SeanNaren / deepspeech.torch

Speech Recognition using DeepSpeech2 network and the CTC activation function.
MIT License
260 stars 73 forks source link

Out of memory issue when Train.lua #97

Closed byuns9334 closed 6 years ago

byuns9334 commented 6 years ago

(I already read this issue: https://github.com/SeanNaren/deepspeech.torch/issues/60) I am trying to train on Librispeech dataset as well, but when I execute the command line 'th Train.lua -batchSize 7 -epochSave -learningRateAnnealing 1.1 -trainingSetLMDBPath prepare_datasets/libri_lmdb/train/ -validationSetLMDBPath prepare_datasets/libri_lmdb/test/' , I get this error:

th Train.lua -batchSize 7 -epochSave -learningRateAnnealing 1.1 -trainingSetLMDBPath prepare_datasets/libri_lmdb/train/ -validationSetLMDBPath prepare_datasets/libri_lmdb/test/ Number of parameters: 108028317 [======================================== 387/387 ====================================>] Tot: 3m7s | Step: 847ms THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-2489/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory /home/byuns9334/torch/install/bin/luajit: ...e/byuns9334/torch/install/share/lua/5.1/nn/Container.lua:67: In 2 module of nn.Sequential: In 1 module of nn.Sequential: In 5 module of cudnn.BatchBRNNReLU: /home/byuns9334/torch/install/share/lua/5.1/cudnn/init.lua:265: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-2489/cutorch/lib/THC/generic/THCStorage.cu:66 stack traceback: [C]: in function 'resize' /home/byuns9334/torch/install/share/lua/5.1/cudnn/init.lua:265: in function 'allocateStorage' /home/byuns9334/torch/install/share/lua/5.1/cudnn/init.lua:324: in function 'setSharedWorkspaceSize' /home/byuns9334/torch/install/share/lua/5.1/cudnn/RNN.lua:537: in function </home/byuns9334/torch/install/share/lua/5.1/cudnn/RNN.lua:404> [C]: in function 'xpcall' ...e/byuns9334/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' .../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function <.../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' ...e/byuns9334/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' .../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function <.../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' ...e/byuns9334/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' .../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./ModelEvaluator.lua:70: in function 'runEvaluation' ./Network.lua:75: in function 'testNetwork' ./Network.lua:168: in function 'trainNetwork' Train.lua:43: in main chunk [C]: in function 'dofile' ...9334/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' ...e/byuns9334/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' .../byuns9334/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./ModelEvaluator.lua:70: in function 'runEvaluation' ./Network.lua:75: in function 'testNetwork' ./Network.lua:168: in function 'trainNetwork' Train.lua:43: in main chunk [C]: in function 'dofile' ...9334/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk [C]: at 0x00405d50

I think 7 is already small enough for mini-batch size, so why do I get this OOM error still? Any ideas how to fix this?

byuns9334 commented 6 years ago

I still get the error with mini batch size 1.