Issues to Run on GPU - Githubissues

boleamol commented 8 years ago

Hi, Thanks for your support up to now, We are simultaneously running on GPU also. We are using entry level NVIDIA GPU, Quadro K420, which is having 192 CUDA Cores and Total Memory 1024MB. I installed all the dependencies which is mentioned by you in README.md file. I am facing the following error. After this error also I checked the dependencies but no change.

"**Training Epoch: 1 lua: /root/torch/install/share/lua/5.1/nn/Container.lua:67: In 1 module of nn.Sequential: In 1 module of nn.Sequential: /root/torch/install/share/lua/5.1/cudnn/init.lua:58: Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnSetFilterNdDescriptor) stack traceback: C: in function 'error' /root/torch/install/share/lua/5.1/cudnn/init.lua:58: in function 'errcheck' ...h/install/share/lua/5.1/cudnn/SpatialConvolution.lua:45: in function 'resetWeightDescriptors' ...h/install/share/lua/5.1/cudnn/SpatialConvolution.lua:358: in function <...h/install/share/lua/5.1/cudnn/SpatialConvolution.lua:357> (tail call): ? C: in function 'xpcall' /root/torch/install/share/lua/5.1/nn/Container.lua:58: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41> (tail call): ? C: in function 'xpcall' /root/torch/install/share/lua/5.1/nn/Container.lua:58: in function 'rethrowErrors' /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41> (tail call): ? ./Network.lua:95: in function 'opfunc' /root/torch/install/share/lua/5.1/optim/sgd.lua:44: in function 'sgd' ./Network.lua:111: in function 'trainNetwork' AN4CTCTrain.lua:40: in main chunk

Please support ...

SeanNaren commented 8 years ago

Hm this is strange it is working fine on my end. Just as a test in the Network.lua class could you replace all cudnn.withnn. in the createSpeechNetwork() method and try running again? We can find out if it is just a cudnn problem or if there is something within the code.

boleamol commented 8 years ago

As per your guidance I modified createSpeechNetwork() method and now it is running, but GPU memory is less so it is giving error "Training Epoch: 1 THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2631/cutorch/lib/THC/generic/THCStorage.cu line=41 error=2 : out of memory lua: .../speech/torch/install/share/lua/5.1/nn/Container.lua:69: In 1 module of nn.Sequential: In 5 module of nn.Sequential: /home/speech/torch/install/share/lua/5.1/nn/THNN.lua:109: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-2631/cutorch/lib/THC/generic/THCStorage.cu:41 stack traceback: [C]: in function 'v' /home/speech/torch/install/share/lua/5.1/nn/THNN.lua:109: in function 'SpatialConvolutionMM_updateOutput' ...orch/install/share/lua/5.1/nn/SpatialConvolution.lua:104: in function <...orch/install/share/lua/5.1/nn/SpatialConvolution.lua:100> " I also observed memory usage its reached to 97%.. Now waiting for New GPU with high configuration... Anyhow Thanks for support ....

SeanNaren commented 8 years ago

Ah so it is a cudnn issue, have you installed cudnn via the nvidia library (copying the .so files etc to the /usr/local/cuda install location, and adding to the ~/.bashrc?

And yeah because of the batching it might use a bit of memory, I'm running a GTX 970 with 4gb and it fits alright onto mem.

Hopefully in the coming weeks I completely redo the master branch with whats coming in the voxforge update branch which will allow the minibatch size to be customised (put a max minibatch size) which will reduce memory overhead.

boleamol commented 8 years ago

Yes, I installed cudnn via nvidia library also copied .so files to the /usr/local/cuda install location, and added to the ~/.bashrc.. Then also issue was there.. Lets I will also try again.. If you reducing batch size then that is good for me.. Thank you..

SeanNaren commented 8 years ago

Hopefully once I merge branches it will allow you to run the model on your PC, I'll close the issue for now!

boleamol commented 8 years ago

Ok, fine Thank you sir...

slbinilkumar commented 8 years ago

CUDNN_STATUS_BAD_PARAM this issue can be solved by using cudnn 4 version and put it in LD_Library path

SeanNaren / deepspeech.torch

Issues to Run on GPU #4