Using a Custom Dataset - Githubissues

shantanudev commented 8 years ago

Hi Sean,

I was trying to tweak your code to incorporate the Librispeech dataset, which has ~1000 hours of data (http://www.openslr.org/12/). However, for some reason the loss shows 'nan' after a few epochs. It is most likely something to do with how I am processing the data. I was hoping if you had any advice.

walkers-mv commented 8 years ago

I am having/had the same problem. When setting up an4 for phonemes, I went thru multiple iterations to get the dictionary coding right. There are a lot of moving parts here, and I found at least 3 that got me here.

1) dictionary must be formatted correctly, index 0 must be a "no output" or blank token 2) output must be set to the width of the dictionary. 3) dictionary must be concise - the librispeech lexiconand.txt and/or phones.txt are pretty complex relative to an4.

Secondarily, it's my suspicion is that there are some edge cases where the gradient for the CTC loss function explodes, maybe "no single token correct". These networks are best trained with incrementally longer sequences... I'm considering trying to pretrain with an4. In kaldi's asr recipe for librispeech, they use a super complicated staged training process incrementally growing the complexity of the model. Though DS2 didn't really address that in detail, I presume they don't "just go for the gold out of the gate".

Are you trying to predict characters, or phonemes? If the later, how did you handle OOV words? Librispeech's lexicon allows and a bunch of other non-word tokens. I think that was breaking some stuff for me.

shantanudev commented 8 years ago

I am trying to build a character based acoustic model. For some of the unique characters, I subsetted the dataset to exclude any odd characters.

SeanNaren commented 8 years ago

It seems that multiple people have been having this issue with exploding gradients. A fix offered by a friend @ccorfield was to clamp the predictions made by CTC using

predictions:add(-torch.max(predictions))

So place this just after the module:forward(inputs). Could someone verify how successful this is?

fanskyer commented 8 years ago

i found clamp the gradient also helps, also you'll need to gradually lower the gradient for every several epoch [baidu did divide 1.2 for each epoch, but their data is quite large]

SeanNaren commented 8 years ago

@fanskyer am I right in thinking that gradParameters:div(inputs:size(1)) has a similar behaviour?

seed93 commented 8 years ago

Gradient clamp is critically important. Here is my fork and it works well for LibriSpeech. You can run prepare_an4/generateLMDB.lua to generate LMDB dataset and the simply run LibriSpeech.sh. However I met some fatal problems in Chinese. There are still some bugs to fix. Sorry for that.

cricket1 commented 8 years ago

@seed93 did you use all the training sets i.e train-clean-100 ,train-clean-360 and train-clean-500 together ?

shantanudev commented 8 years ago

@seed93 Thank you. I will give this a try.

fanskyer commented 8 years ago

@SeanNaren I just clamp on gradParameters:clamp(-X,X) . from my understanding in baidu's paper they use 50, in google's paper they use 1, i think it also depends on batch size.

SeanNaren commented 8 years ago

Lots of good stuff here, let's see what information we can standardise out, and if be I'll get in contact with devs of DS2 for additional information.

I should definitely detail out the documentation to explaining how to implement your own dataset though

seed93 commented 8 years ago

@cricket1 I only use train-clean-100 and train-clean-360 for training as train-other-500 has too much noise.

cricket1 commented 8 years ago

@seed93 thanks for the info

SeanNaren commented 8 years ago

So going off the paper it seems that we might need to normalize the gradients (taken from the paper):

If the norm of the gradient exceeds a threshold of 400, it is rescaled to 400

shantanudev commented 8 years ago

@seed93 Do you have the script that processes the audio files for the Librispeech? It seems I am getting an error on some of my files when I convert it to the lmdb.

seed93 commented 8 years ago

@shantanudev You can use this python script to generate an index file used by prepare_an4/generateLMDB.lua. librispeech.txt

shantanudev commented 8 years ago

@seed93 Hey, so I am utilizing some of your network for another data. I ran into this lmdb issue. Did you face this issue? Iter: [1][1]. Time 2.927 data 0.335 Ratio 0.114. Error: 7.815. Learning rate: 0.001000
Iter: [1][2]. Time 1.765 data 0.111 Ratio 0.063. Error: 7.389. Learning rate: 0.001000
Iter: [1][3]. Time 1.805 data 0.029 Ratio 0.016. Error: 7.284. Learning rate: 0.001000
Iter: [1][4]. Time 2.738 data 0.054 Ratio 0.020. Error: 3.921. Learning rate: 0.001000
Error in LMDB function mdb_get : MDB_NOTFOUND: No matching key/data pair found

seed93 commented 8 years ago

@shantanudev I am not sure. Could you please check Loader.lua line 191, does start added by 1? If no, maybe it's the key point.

shantanudev commented 8 years ago

@seed93 yes, you are right. I think it has something to do with the Loader.lua. I switched it from randomized next batch to just default and it ran for over ~1400 iterations and then it threw the same issue MDB_NOTFOUND.

shantanudev commented 8 years ago

@SeanNaren Hey Sean, I did some gradient rescaling and clamping for this Librispeech dataset. I no longer face an exploding gradient issue, but now I can't get the loss down. Below is my Output. Any advice?

Training Epoch: 1 Average Loss: 486.412546 Average Validation WER: 100.00% Average Validation CER: 76.99%
[======================================== 1359/1359 ==================================>] Tot: 31m35s | Step: 1s696ms
Training Epoch: 2 Average Loss: 474.358469 Average Validation WER: 100.00% Average Validation CER: 71.38%
[======================================== 1359/1359 ==================================>] Tot: 31m52s | Step: 1s701ms
Training Epoch: 3 Average Loss: 467.973357 Average Validation WER: 100.00% Average Validation CER: 72.36%
[======================================== 1359/1359 ==================================>] Tot: 31m54s | Step: 1s700ms
Training Epoch: 4 Average Loss: 462.354399 Average Validation WER: 100.00% Average Validation CER: 76.07%
[======================================== 1359/1359 ==================================>] Tot: 32m1s | Step: 1s712ms
Training Epoch: 5 Average Loss: 458.244134 Average Validation WER: 100.00% Average Validation CER: 74.55%
[======================================== 1359/1359 ==================================>] Tot: 32m9s | Step: 1s715ms
Training Epoch: 6 Average Loss: 455.192114 Average Validation WER: 100.00% Average Validation CER: 80.35%
[======================================== 1359/1359 ==================================>] Tot: 32m10s | Step: 1s719ms
Training Epoch: 7 Average Loss: 453.269946 Average Validation WER: 100.00% Average Validation CER: 80.44%
[======================================== 1359/1359 ==================================>] Tot: 32m13s | Step: 1s709ms
Training Epoch: 8 Average Loss: 451.999062 Average Validation WER: 100.00% Average Validation CER: 81.34%
[======================================== 1359/1359 ==================================>] Tot: 32m14s | Step: 1s722ms
Training Epoch: 9 Average Loss: 451.329876 Average Validation WER: 100.00% Average Validation CER: 86.85%
[======================================== 1359/1359 ==================================>] Tot: 32m15s | Step: 1s720ms
Training Epoch: 10 Average Loss: 450.899667 Average Validation WER: 99.69% Average Validation CER: 86.73%
[======================================== 1359/1359 ==================================>] Tot: 32m23s | Step: 1s723ms
Training Epoch: 11 Average Loss: 450.707016 Average Validation WER: 100.00% Average Validation CER: 77.37%
[======================================== 1359/1359 ==================================>] Tot: 32m20s | Step: 1s724ms
Training Epoch: 12 Average Loss: 450.444550 Average Validation WER: 100.00% Average Validation CER: 82.45%
[======================================== 1359/1359 ==================================>] Tot: 32m23s | Step: 1s717ms
Training Epoch: 13 Average Loss: 450.567040 Average Validation WER: 100.00% Average Validation CER: 82.50%
[======================================== 1359/1359 ==================================>] Tot: 32m28s | Step: 1s728ms
Training Epoch: 14 Average Loss: 450.746221 Average Validation WER: 100.00% Average Validation CER: 87.40%
[======================================== 1359/1359 ==================================>] Tot: 32m30s | Step: 1s725ms
Training Epoch: 15 Average Loss: 450.957603 Average Validation WER: 100.00% Average Validation CER: 80.61%
[======================================== 1359/1359 ==================================>] Tot: 32m32s | Step: 1s731ms
Training Epoch: 16 Average Loss: 451.233814 Average Validation WER: 100.00% Average Validation CER: 80.71%
[======================================== 1359/1359 ==================================>] Tot: 32m34s | Step: 1s731ms
Training Epoch: 17 Average Loss: 451.624416 Average Validation WER: 99.81% Average Validation CER: 79.73%
[======================================== 1359/1359 ==================================>] Tot: 32m35s | Step: 1s744ms
Training Epoch: 18 Average Loss: 451.924486 Average Validation WER: 100.00% Average Validation CER: 87.77%
[======================================== 1359/1359 ==================================>] Tot: 32m37s | Step: 1s740ms
Training Epoch: 19 Average Loss: 452.446792 Average Validation WER: 100.00% Average Validation CER: 83.93%
[======================================== 1359/1359 ==================================>] Tot: 32m46s | Step: 1s737ms
Training Epoch: 20 Average Loss: 452.826641 Average Validation WER: 100.00% Average Validation CER: 89.59%
[======================================== 1359/1359 ==================================>] Tot: 32m48s | Step: 1s746ms
Training Epoch: 21 Average Loss: 453.145179 Average Validation WER: 100.00% Average Validation CER: 84.36%
[======================================== 1359/1359 ==================================>] Tot: 32m54s | Step: 1s758ms
Training Epoch: 22 Average Loss: 453.498289 Average Validation WER: 100.00% Average Validation CER: 84.25%
[======================================== 1359/1359 ==================================>] Tot: 32m49s | Step: 1s746ms
Training Epoch: 23 Average Loss: 453.906065 Average Validation WER: 100.00% Average Validation CER: 83.26%
[======================================== 1359/1359 ==================================>] Tot: 32m51s | Step: 1s755ms
Training Epoch: 24 Average Loss: 454.523965 Average Validation WER: 100.00% Average Validation CER: 90.90%
[======================================== 1359/1359 ==================================>] Tot: 32m57s | Step: 1s747ms
Training Epoch: 25 Average Loss: 455.264382 Average Validation WER: 100.00% Average Validation CER: 88.92%
[======================================== 1359/1359 ==================================>] Tot: 32m55s | Step: 1s750ms
Training Epoch: 26 Average Loss: 456.236383 Average Validation WER: 99.87% Average Validation CER: 94.25%
[======================================== 1359/1359 ==================================>] Tot: 32m59s | Step: 1s756ms
Training Epoch: 27 Average Loss: 457.110262 Average Validation WER: 99.63% Average Validation CER: 97.46%
[======================================== 1359/1359 ==================================>] Tot: 33m1s | Step: 1s756ms
Training Epoch: 28 Average Loss: 457.201546 Average Validation WER: 99.86% Average Validation CER: 93.42%
[======================================== 1359/1359 ==================================>] Tot: 33m4s | Step: 1s762ms
Training Epoch: 29 Average Loss: 457.056914 Average Validation WER: 100.00% Average Validation CER: 97.75%
[======================================== 1359/1359 ==================================>] Tot: 33m3s | Step: 1s756ms
Training Epoch: 30 Average Loss: 457.061620 Average Validation WER: 99.82% Average Validation CER: 98.60%
[======================================== 1359/1359 ==================================>] Tot: 33m2s | Step: 1s759ms
Training Epoch: 31 Average Loss: 456.853254 Average Validation WER: 100.00% Average Validation CER: 91.57%
[======================================== 1359/1359 ==================================>] Tot: 33m8s | Step: 1s768ms
Training Epoch: 32 Average Loss: 456.809145 Average Validation WER: 100.00% Average Validation CER: 96.19%
[======================================== 1359/1359 ==================================>] Tot: 33m11s | Step: 1s751ms
Training Epoch: 33 Average Loss: 456.829869 Average Validation WER: 100.00% Average Validation CER: 92.55%
[======================================== 1359/1359 ==================================>] Tot: 33m9s | Step: 1s759ms
Training Epoch: 34 Average Loss: 456.849699 Average Validation WER: 100.00% Average Validation CER: 91.85%
[======================================== 1359/1359 ==================================>] Tot: 33m11s | Step: 1s759ms
Training Epoch: 35 Average Loss: 456.827805 Average Validation WER: 99.77% Average Validation CER: 95.27%
[======================================== 1359/1359 ==================================>] Tot: 33m3s | Step: 1s753ms
Training Epoch: 36 Average Loss: 456.540524 Average Validation WER: 99.82% Average Validation CER: 91.53%
[======================================== 1359/1359 ==================================>] Tot: 33m6s | Step: 1s754ms
Training Epoch: 37 Average Loss: 456.459262 Average Validation WER: 100.00% Average Validation CER: 84.61%
[======================================== 1359/1359 ==================================>] Tot: 33m5s | Step: 1s744ms
Training Epoch: 38 Average Loss: 456.627596 Average Validation WER: 100.00% Average Validation CER: 89.98%
[======================================== 1359/1359 ==================================>] Tot: 33m9s | Step: 1s766ms
Training Epoch: 39 Average Loss: 456.650785 Average Validation WER: 100.00% Average Validation CER: 85.57%
[======================================== 1359/1359 ==================================>] Tot: 33m13s | Step: 1s755ms
Training Epoch: 40 Average Loss: 456.761818 Average Validation WER: 99.91% Average Validation CER: 80.78%
[======================================== 1359/1359 ==================================>] Tot: 33m6s | Step: 1s762ms
Training Epoch: 41 Average Loss: 456.813289 Average Validation WER: 100.00% Average Validation CER: 89.45%
[======================================== 1359/1359 ==================================>] Tot: 33m16s | Step: 1s764ms
Training Epoch: 42 Average Loss: 456.791189 Average Validation WER: 100.00% Average Validation CER: 85.62%
[======================================== 1359/1359 ==================================>] Tot: 33m11s | Step: 1s750ms
Training Epoch: 43 Average Loss: 456.848691 Average Validation WER: 100.00% Average Validation CER: 88.37%
[======================================== 1359/1359 ==================================>] Tot: 33m16s | Step: 1s770ms
Training Epoch: 44 Average Loss: 456.819969 Average Validation WER: 100.00% Average Validation CER: 82.89%
[======================================== 1359/1359 ==================================>] Tot: 33m17s | Step: 1s774ms
Training Epoch: 45 Average Loss: 456.868943 Average Validation WER: 100.00% Average Validation CER: 86.05%
[======================================== 1359/1359 ==================================>] Tot: 33m21s | Step: 1s768ms
Training Epoch: 46 Average Loss: 456.795347 Average Validation WER: 100.00% Average Validation CER: 85.85%
[======================================== 1359/1359 ==================================>] Tot: 33m23s | Step: 1s773ms
Training Epoch: 47 Average Loss: 456.875730 Average Validation WER: 100.00% Average Validation CER: 87.32%
[======================================== 1359/1359 ==================================>] Tot: 33m26s | Step: 1s783ms
Training Epoch: 48 Average Loss: 456.884329 Average Validation WER: 100.00% Average Validation CER: 83.81%
[======================================== 1359/1359 ==================================>] Tot: 33m30s | Step: 1s786ms
Training Epoch: 49 Average Loss: 456.855200 Average Validation WER: 100.00% Average Validation CER: 87.49%
[======================================== 1359/1359 ==================================>] Tot: 33m29s | Step: 1s776ms
Training Epoch: 50 Average Loss: 456.842913 Average Validation WER: 100.00% Average Validation CER: 90.02%
[======================================== 1359/1359 ==================================>] Tot: 33m37s | Step: 1s778ms
Training Epoch: 51 Average Loss: 456.774143 Average Validation WER: 100.00% Average Validation CER: 85.38%
[======================================== 1359/1359 ==================================>] Tot: 33m36s | Step: 1s796ms
Training Epoch: 52 Average Loss: 456.641994 Average Validation WER: 100.00% Average Validation CER: 78.57%
[======================================== 1359/1359 ==================================>] Tot: 33m35s | Step: 1s790ms
Training Epoch: 53 Average Loss: 456.646329 Average Validation WER: 100.00% Average Validation CER: 80.25%
[======================================== 1359/1359 ==================================>] Tot: 33m48s | Step: 1s778ms
Training Epoch: 54 Average Loss: 456.638523 Average Validation WER: 100.00% Average Validation CER: 80.59%
[======================================== 1359/1359 ==================================>] Tot: 33m45s | Step: 1s780ms
Training Epoch: 55 Average Loss: 456.681671 Average Validation WER: 100.00% Average Validation CER: 80.81%
[======================================== 1359/1359 ==================================>] Tot: 33m48s | Step: 1s783ms
Training Epoch: 56 Average Loss: 456.619647 Average Validation WER: 100.00% Average Validation CER: 83.38%
[======================================== 1359/1359 ==================================>] Tot: 33m46s | Step: 1s804ms
Training Epoch: 57 Average Loss: 456.616817 Average Validation WER: 100.00% Average Validation CER: 87.04%
[======================================== 1359/1359 ==================================>] Tot: 33m52s | Step: 1s799ms
Training Epoch: 58 Average Loss: 456.694973 Average Validation WER: 100.00% Average Validation CER: 78.63%
[======================================== 1359/1359 ==================================>] Tot: 33m55s | Step: 1s804ms
Training Epoch: 59 Average Loss: 456.601417 Average Validation WER: 100.00% Average Validation CER: 73.40%

SeanNaren commented 8 years ago

What did you rescale to? 400? what is your outputs size (for example for english in this repo it is 28).

shantanudev commented 8 years ago

Yes, I tried rescaling by 400 but didn't face any improvements. By output size do you mean character output size? I added a apostrophe which makes the output len 29.

shantanudev commented 8 years ago

@seed93 Question,how many Epochs did you run before you got a decent prediction on the Librispeech data? I think I finally got the loss to go down without it exploding.

seed93 commented 8 years ago

@shantanudev less than 70 epochs. Sorry I didn't remember the accurate number.

SeanNaren commented 8 years ago

I'm going to be working on generalising the dataloading system right now to make it easier to attach your own datasets. Also their are improvements on the main branch, it might be worth checking out!

SeanNaren commented 8 years ago

Alright hopefully this shall be useful!

I've added a standard format that the data is required to be in, before converting to an LMDB using concurrent processes to speed the process up.

Firstly you will need the lua---parallel library luarocks install parallel

The data-format of the raw dataset has to be:

<root><train/test><datasetname><filename.wav/filename.txt>

An example would be:

dataset/train/an4/fash.wav and dataset/train/an4/fash.txt for the audio file and transcript file.

Now for LibriSpeech and AN4 I've added scripts to handle formatting and step by step instructions for using torch to format the data, and put it into LMDB format here.

It might be quicker to use something like hadoop or spark to do this for larger datasets, but hopefully this helps people out! I haven't tested the scripts on the full 1000 hours (mainly my internet slow as hell, I'll download and set these up as soon as I can!)

Finally someone mentioned that with the chinese alphabet exploding gradients are a even bigger issue. A fix for this might be to reduce the maxNorm from 400 to something much smaller by passing the variable -maxNorm 200 to the train script.

shantanudev commented 8 years ago

@SeanNaren Hi Sean, I just checked out your latest commit. The network architecture explodes after the first epoch on the Librispeech data clean-100. Do you have any suggestions?

Number of parameters: 108028317
[======================================= 3171/3171 =================================>] Tot: 1h22m | Step: 1s571ms
Training Epoch: 1 Average Loss: nan Average Validation WER: 100.00 Average Validation CER: 100.00

Update: I decided to drop the hidden size to 1000 and the number of hidden layers down to 5. I also changed the momentum to 0.7.

For the first time, I have seen the WER% drop below 99% and CER% to be below 70%. Below is the first sample outputs from the validation. I will update the results after the model is finished training.

shantanudev commented 8 years ago

@SeanNaren I've been running for around ~35 epochs. The training has been a bit weird. I have been training this using 3 Nvidia GTX 980 Ti with a batch size of 18. Is batch size small to the point where it might be an issue? At random epochs, the model explodes but the next epoch it comes back. Also, the WER% is jumping around quite a bit. It definitely does seem to be learning and picking up representation of the speech. Do you have any suggestions for me on what I can try?

loss WER CER
inf 9.9246e+01 8.8079e+00 1.5150e+02 8.6333e+01 1.2404e+01 1.1644e+02 8.2497e+01 1.2417e+01 -inf 7.6916e+01 9.5236e+00 8.2299e+01 7.0035e+01 1.1457e+01 -inf 7.4302e+01 1.9821e+01 6.2308e+01 9.0021e+01 2.6702e+01 5.4905e+01 7.2844e+01 2.2266e+01 4.8426e+01 7.8774e+01 2.9040e+01 4.2758e+01 7.8627e+01 3.2055e+01 3.7930e+01 7.5979e+01 2.5179e+01 3.3511e+01 6.9580e+01 3.4965e+01 2.9752e+01 6.7460e+01 2.6842e+01 2.6208e+01 7.2656e+01 3.1384e+01 2.3027e+01 6.9195e+01 3.0568e+01 2.0228e+01 6.6890e+01 2.8090e+01 1.7778e+01 6.6287e+01 2.7254e+01 1.5631e+01 7.1556e+01 2.2019e+01 1.3852e+01 7.2263e+01 2.8995e+01 1.2132e+01 6.8911e+01 2.5129e+01 1.0720e+01 6.8442e+01 3.3482e+01 -inf 6.4400e+01 2.1416e+01 8.2620e+00 6.7654e+01 2.1184e+01 7.3628e+00 8.1044e+01 2.2727e+01 6.5372e+00 5.7026e+01 1.5250e+01 5.7744e+00 6.2810e+01 1.7085e+01 5.1574e+00 6.1446e+01 1.8662e+01 4.6241e+00 5.3268e+01 1.3991e+01 4.1585e+00 6.0462e+01 1.7198e+01 3.7132e+00 6.9269e+01 1.9125e+01 3.2708e+00 6.3004e+01 1.5439e+01 3.0832e+00 7.3801e+01 2.5169e+01 2.7768e+00 5.1521e+01 1.3513e+01 2.5497e+00 5.3252e+01 1.1575e+01 -inf 6.4034e+01 1.8057e+01 2.1054e+00 6.0191e+01 1.3515e+01 2.0199e+00 5.0612e+01 9.2022e+00 1.8904e+00 6.7527e+01 2.2219e+01

SeanNaren commented 8 years ago

Something that I haven't done is limit the sizes of the batches, i.e it might not be a good idea to train on really small initial sizes, or anything larger than 15 seconds. I can't promise I'll get around to it soon but when I can I'll try to get this to converge on librispeech.

zssloth commented 8 years ago

@shantanudev , constrain the maxNorm to a smaller value (like 100) with a batch size of 15 seems to work for me. I also use the Librispeech data clean-100 for training.

Number of parameters: 108028317 Training Epoch: 1 Average Loss: 244.608092 Average Validation WER: 83.58% Average Validation CER:18.01% Training Epoch: 2 Average Loss: 154.738264 Average Validation WER: 67.37% Average Validation CER:7.71%
Training Epoch: 3 Average Loss: 120.194711 Average Validation WER: 57.30% Average Validation CER:5.78%
Training Epoch: 4 Average Loss: 97.769768 Average Validation WER: 54.36% Average Validation CER: 5.66%
Training Epoch: 5 Average Loss: 80.730050 Average Validation WER: 50.82% Average Validation CER: 4.80%
Training Epoch: 6 Average Loss: 67.619897 Average Validation WER: 50.13% Average Validation CER: 5.23%
Training Epoch: 7 Average Loss: 56.514421 Average Validation WER: 47.05% Average Validation CER: 4.28%
Training Epoch: 8 Average Loss: 47.239716 Average Validation WER: 44.99% Average Validation CER: 4.13%

shantanudev commented 8 years ago

Thank you everyone for you suggestions. I will let you know the results.

nn-learner commented 8 years ago

@ZhishengWang Hi, I recently have been experimenting and am currently facing similar issues as @shantanudev. Was 15 your batch size for GPU or was it across multiple GPUs?

zssloth commented 8 years ago

@nn-learner, I was trained on just one GPU (Tesla M40) with a training batch size of 15. (but the reason that I didn't use a lager batch size mainly because of the limited memory available for M40..)

SeanNaren commented 8 years ago

I've opened a new issue tracking progress of a Librispeech model. A new method of adding custom datasets was added to the main branch in #44.

SeanNaren / deepspeech.torch

Using a Custom Dataset #39