Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnGetConvolutionNdForwardOutputDim)

dahlem commented 7 years ago

I'd like to run deepspeech.torch on a training dataset of size 1M wav files using an AWS's p2.8xlarge instance and I'm running into the stack trace below. I installed torch and deepspeech.torch according to the installation instructions.

I run the training as follows: th Train.lua -epochSave \ -learningRateAnnealing 1.1 \ -trainingSetLMDBPath data_lmdb/train/ \ -validationSetLMDBPath data_lmdb/test/ \ -nGPU 8 \ -logsTrainPath logs/deepspeech-big/TrainingLoss/ \ -logsValidationPath logs/deepspeech-big/ValidationScores/ \ -modelTrainingPath models/deepspeech-big/ \ -epochs 500 \ -learningRate 0.01 \ -maxNorm 20 \ -momentum 0.9 \ -batchSize 32 \ -validationBatchSize 32 \ -permuteBatch

I have no problem with the 1000 hours of LibriSpeech data.

Any help is greatly appreciated. Dominik

luajit: ...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:67: In 1 module of nn.Sequential: In 4 module of nn.Sequential: /home/ubuntu/torch/install/share/lua/5.1/cudnn/init.lua:162: Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnGetConvolutionNdForwardOutputDim) stack traceback: [C]: in function 'error' /home/ubuntu/torch/install/share/lua/5.1/cudnn/init.lua:162: in function 'errcheck' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:140: in function 'createIODescriptors' ...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:188: in function <...torch/install/share/lua/5.1/cudnn/SpatialConvolution.lua:186> [C]: in function 'xpcall' /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function</home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' /home/ubuntu/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/ubuntu/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' ...e/ubuntu/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback' /home/ubuntu/torch/install/share/lua/5.1/threads/queue.lua:65: in function </home/ubuntu/torch/install/share/lua/5.1/threads/queue.lua:41> [C]: in function 'pcall' /home/ubuntu/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob' [string " local Queue = require 'threads.queue'..."]:13: in main chunk

SeanNaren commented 7 years ago

How large are the wav files (in seconds?). I wonder if it's running out of memory. Could you monitor the cuda memory usage whilst training?

suhaspillai commented 7 years ago

I think it is because some of your wav files are short (in secs), check the number of time steps after 1st convolution operation, if it is less than the filter width of 2nd convolution filter, then it will throw an error. Atleast for me this was the case.

dahlem commented 7 years ago

@SeanNaren, the wav files are between 0.12 and 40 seconds. Memory did not seem to be the issue. I looked into @suhaspillai suggestion and cut out wav files that were too short and that is working now.

Thank you.

SeanNaren commented 7 years ago

Just to add to this, I think due to the convolutions, the minimum length that a clip can be is 0.5 seconds. I'd highly suggest cutting out anything that isn't 1 second or longer though.

SeanNaren / deepspeech.torch

Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnGetConvolutionNdForwardOutputDim) #86