Poor performance of Merlin on GPU

pglushkov commented 7 years ago

Hi all! Recently I've been running experiments with Merlin and tried to speed-up the learning process by utilizing GPU. Unfortunatelly I could only achieve about 50% increase in learning time, which is a bit suprising, because all Theano testing scripts had shown 10x to 30x performance increase.

Some details: OS : Ubuntu 14.04 GPU : GeForce GTX 980 Theano : 0.8.2 NVidia drivers : 361 CUDA / CUDA Toolkit : 7.5 CuDNN : 5.0

From Merlin logs I conclude that GPU / CuDNN are being used. This is also confirmed by system performance monitoring. Network that I learn is pretty trivial: 4 layers - TANH, TANH, TANH, LSTM. Nothing fancy, just a training example.

Has anyone ever encountered any issues like that? What other information I can provide to better describe the problem?

Any other comments and shared experience concearning Merling + GPU are very welcome :)

dreamk73 commented 7 years ago

We currently have 2 machines set up to train models in Merlin. How many files are you training and how long does it take to train the acoustic model? And does it say it is using the GPU when you start the script?

For me, here are some recent stats:

machine 1: gaming laptop OS: Ubuntu 14.04 GPU: GeForce GTX860 Theano: 0.8.2 numpy 1.11.2 NVidia driver: 352.39 CUDA: 7.5 CuDNN: 5.1 slt_arctic_full acoustic db training: 39.39 minues This db uses 6 TANH layers with 1024 nodes per layer. I have not tried using an LSTM layer at the end yet.

machine 2: new desktop OS: CentOS7 GPU: GeForce GTX1080 Theano: 0.8.2 numpy 1.11.2 NVidia driver: 367.44 CUDA: 8 CuDNN: 5.1 slt_arctic_full acoustic db training: 5.99 minutes

On Thu, Jan 26, 2017 at 11:55 AM, pglushkov notifications@github.com wrote:

Hi all! Recently I've been running experiments with Merlin and tried to speed-up the learning process by utilizing GPU. Unfortunatelly I could only achieve about 50% increase in learning time, which is a bit suprising, because all Theano testing scripts had shown 10x to 30x performance increase.

Some details: OS : Ubuntu 14.04 GPU : GeForce GTX 980 Theano : 0.8.2 NVidia drivers : 361 CUDA / CUDA Toolkit : 7.5 CuDNN : 5.0

From Merlin logs I conclude that GPU / CuDNN are being used. This is also confirmed by system performance monitoring. Network that I leart is pretty trivial 4 layers - TANH, TANH, TANH, LSTM. Nothing fancy, just a training example.

Has anyone ever encountered any issues like that? What other information I can provide to better describe the problem?

Any other comments and shared experience concearning Merling + GPU are very welcome :)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/CSTR-Edinburgh/merlin/issues/78, or mute the thread https://github.com/notifications/unsubscribe-auth/ASbibMG3NOd-wrGEWO9UChMp8S077XCXks5rWHuYgaJpZM4LugxH .

pglushkov commented 7 years ago

Hi, dreamk73, thanks for your reply!

Log of training clearly states that GPU is being utilized, yes. It says: ... Running on GPU id=0 ... Using gpu device 0: GeForce GTX 980 (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 5005) ...

We use 200 files per 1 epoch of acoustic model training, duration of each file is about 3 sec on average. Samplerate is 16000. Sizes of our layres are (512, 512, 512, 512).

One epoch of training takes about 40 seconds on a GPU. At the same time, when Theano uses CPU + OpenMP, 1 epoch of training takes about 60 seconds. So speed-up does not seem that impressive.

And how long does 1 epoch of acoustic model training take on your machines?

dreamk73 commented 7 years ago

The average time on the laptop is 85 seconds per epoch and on the new desktop only 10 seconds. I am not sure what all contributes to the processing speed. I haven't tried to run it with the CPU but I do know that it is a lot slower.

I know that the GTX 1080 is much faster than older models. But other factors can be at play here as well. What version of numpy do you use? One of the merlin people told me that could make a difference as well.

pglushkov commented 7 years ago

Dear dreamk73,

current version of numpy in environment used by Merlin on our PC is 1.8.2 Will try to update it and measure the performance again.

Could you please tell me how many files do you process per 1 epoch and what is the mean duration of each file?

spoofingchallenge commented 7 years ago

@pglushkov are you using local disk or network disk? Could you please check the I/O time and CPU time?

pglushkov commented 7 years ago

We are using local disk. Sorry, what exactly do you mean by I/O and CPU time?

pglushkov commented 7 years ago

@dreamk73 Today I've tried the training with numpy 1.11.2 - no speed-up unfortunately.

zhizhengwu commented 7 years ago

Could you please only calculate the computation time for this line of code? this_train_error = train_fn(current_finetune_lr, current_momentum)

Other codes are related to reading and processing data, not actual neural network training

pglushkov commented 7 years ago

@zhizhengwu Hello! If you mean the line in the most inner loop of train_DNN() function, then average time for 1 batch execution is about 0.2 sec on average. (in our case 1 batch ~ 1 input wav file which is about 3 sec long on average)

CSTR-Edinburgh / merlin

Poor performance of Merlin on GPU #78