Cuda out of memory error while training on LibriSpeech dataset

suhaspillai commented 7 years ago

I am trying to train the model on libri speech dev-clean dataset, where my train split = 2503 and val split = 200. I reduced my val split thinking this might be the issue . Based on the memory consumption (which I checked using nvidia-smi), I think all the training data is loaded at once and so is the validation right? . Did anyone face this issue? Following is the stack trace

THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6130/cutorch/lib/THC/generic/THCStorage.cu line=40 error=2 : out of memory
/home/sbp3624/torch/install/bin/luajit: /home/sbp3624/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 2 module of nn.Sequential:
In 1 module of nn.Sequential:
In 5 module of cudnn.BatchBRNNReLU:
/home/sbp3624/torch/install/share/lua/5.1/cudnn/RNN.lua:308: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-6130/cutorch/lib/THC/generic/THCStorage.cu:40
stack traceback:
    [C]: in function 'resize'
    /home/sbp3624/torch/install/share/lua/5.1/cudnn/RNN.lua:308: in function </home/sbp3624/torch/install/share/lua/5.1/cudnn/RNN.lua:262>
    [C]: in function 'xpcall'
    /home/sbp3624/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /home/sbp3624/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/sbp3624/torch/install/share/lua/5.1/nn/Sequential.lua:41>
    [C]: in function 'xpcall'
    /home/sbp3624/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /home/sbp3624/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/sbp3624/torch/install/share/lua/5.1/nn/Sequential.lua:41>
    [C]: in function 'xpcall'
    /home/sbp3624/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /home/sbp3624/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./ModelEvaluator.lua:70: in function 'runEvaluation'
    ./Network.lua:78: in function 'testNetwork'
    ./Network.lua:170: in function 'trainNetwork'
    Train.lua:42: in main chunk
    [C]: in function 'dofile'
    ...3624/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
    [C]: in function 'error'
    /home/sbp3624/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
    /home/sbp3624/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./ModelEvaluator.lua:70: in function 'runEvaluation'
    ./Network.lua:78: in function 'testNetwork'
    ./Network.lua:170: in function 'trainNetwork'
    Train.lua:42: in main chunk
    [C]: in function 'dofile'
    ...3624/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
    [C]: at 0x00406670

SeanNaren commented 7 years ago

Hey! try turning down the batch size from the default, how much memory do you have on your GPU? You can turn it down via:

th Train.lua -batchSize xx where xx = the batch size you want!

nn-learner commented 7 years ago

Yes, turn down your batch size and also @shantanudev suggested lower the number of hidden nodes as well.

suhaspillai commented 7 years ago

@SeanNaren I had tried with reducing the batch size but it did not work . I think there was some issue with the GPU, because its working now. Thanks for the quick reply

iassael commented 7 years ago

Still getting the same error. The memory usage seems to be increasing always.

SeanNaren commented 7 years ago

What GPU are you running this on @iassael?

iassael commented 7 years ago

@SeanNaren I'm on Tesla GP100. I added collectgarbage() every 100 batches a little after optim.sgd but the problem persists. Do you have any suggestions of where to look at?

SeanNaren commented 7 years ago

Excited to see my stuff running on a P100 :D

What batch size are you using? Have you made it through the first epoch? My assumption is you need to turn the batch size lower using the -batchSize flag, since some of the sequences towards the end of the epoch are fairly long thus take more memory.

fanlamda commented 7 years ago

I met problem out of memory when I use two GPUs. During the second epoch, the usage of memory still continues to go up. Do you have any idea?

iassael commented 7 years ago

@SeanNaren hehe it's the perfect benchmark for these babies :)

My BS is 20, but I make it nearly till the end of the first epoch, as the memory usage increases many folds during the first epoch.

SeanNaren commented 7 years ago

@fanlamda could you try the latest master branch? I've made it an option to permute batches via -permuteBatch in training which defaults to false. I've noticed huge increases in memory due to permuting the batching order for all batches after the first epoch.

@iassael could you try reducing this to like 15 and seeing if you get through? Also I've just started the setup on my end to start training this model on my own server and I've got some advise for peeps. The current set up for the architecture is way overkill, I'd suggest using training params similar to below:

th Train.lua -hiddenSize 600 -LSTM -nbOfHiddenLayers 5

This is a much smaller model (closer to what was the number of params in the DS2 architecture, but it uses LSTM and no weight sharing between the RNNs thus isn't as good) but our dataset is also much smaller. Hopefully all this helps!

iassael commented 7 years ago

@SeanNaren you are right, I tried pulling from the latest branch and switching to a smaller architecture, but the memory is constantly increasing throughout the iterations of the first epoch. So, I thought that it could be the warp_ctc implementation so I switched to nGPU 1 and time-first, still without success. Do you have any intuition of where to look at?

SeanNaren commented 7 years ago

The GPU memory will always increase as we move through the epoch as the batches get larger and larger (and those RNNs take a lot of memory!). Largest batch per GPU which has 12gb of vram is usually around 15 from what I've seen because of this.

I know Baidu were able to get much larger batches but they had their own internal software and were super memory efficient when it came to how to using memory. Sadly with Torch it's a little bit more difficult, but should still be trainable with smaller batch sizes! Hopefully this help!

fanlamda commented 7 years ago

It works @SeanNaren

SeanNaren commented 7 years ago

I have added the command I use to start training here.

SeanNaren commented 7 years ago

If anyone still has issues feel free to open a new issue :)

SeanNaren / deepspeech.torch

Cuda out of memory error while training on LibriSpeech dataset #60