I get the same training error by every epoch

byuns9334 commented 6 years ago

I'm running " th Train.lua -epochSave -learningRateAnnealing 1.1 -trainingSetLMDBPath prepare_datasets/libri_lmdb/train/ -validationSetLMDBPath prepare_datasets/libri_lmdb/test/ -LSTM -hiddenSize 500 -permuteBatch " on librispeech dataset, but I still get the same training error on every epoch, while the loss continuously gets decreased.

Here's what I get:

Number of parameters: 31576697 [==================== 136/136 ================>] Tot: 1m13s | Step: 646ms Training Epoch: 1 Average Loss: nan Average Validation WER: 100.09 Average Validation CER: 62.14 Saving model.. [==================== 136/136 ================>] Tot: 1m18s | Step: 566ms Training Epoch: 2 Average Loss: 7047724391721807312664730917666816.000000 Average Validation WER: 100.05 Average Validation CER: 61.98 Saving model.. [==================== 136/136 ================>] Tot: 1m18s | Step: 588ms Training Epoch: 3 Average Loss: 3568794773768703940829837988462592.000000 Average Validation WER: 100.05 Average Validation CER: 62.00 Saving model.. [==================== 136/136 ================>] Tot: 1m19s | Step: 555ms Training Epoch: 4 Average Loss: nan Average Validation WER: 100.05 Average Validation CER: 62.03 Saving model..

How should I resolve this?

SeanNaren commented 6 years ago

Something definitely looks wrong with the loss... could you run the tests in the warp-ctc repo for torch and make sure the values are not 0s or infs?

byuns9334 commented 6 years ago

@SeanNaren Which command should I write to run the tests in warp-ctc repository?

SeanNaren / deepspeech.torch

I get the same training error by every epoch #98