SeanNaren / deepspeech.pytorch

Speech Recognition using DeepSpeech2.
MIT License
2.11k stars 620 forks source link

At the time of testing model outputs only empty string and a WER of 100%. #505

Closed raotnameh closed 4 years ago

raotnameh commented 4 years ago

I'm training on my own data, the loss decreases while training but at the time of inference, the WER comes to 100% and so does the CER. Also, when I checked the output, it's always an empty string.

Additionally, I did inference using a model trained on librispeech and it works, I get a WER of 54. I'm not sure, why when I use a model trained on my own data it gives an empty string and a WER of 100%?

Any comments on why it's happening?

FYI the dataset is in English.

xieyidi commented 4 years ago

When you train, are the elements in label.json letters or words?

raotnameh commented 4 years ago

@xieyidi It's a list of all the letters plus the special symbols. As mentioned by Sean Naren in the repo.

atifemreyuksel commented 4 years ago

I encountered with the same issue @raotnameh. I have a labels.json below: ["_", "h", "d", "a", "0", "g", "8", "n", "k", "x", "v", "r", "p", "o", "j", "c", "9", "i", "5", "4", "ş", "q", "b", "ü", "7", "6", "y", "s", "w", "u", "2", "3", "1", "e", "t", "l", "ç", "ı", "f", "z", "m", "ö", "ğ", " "]

I have a dataset including Turkish special characters and digits. But it does not contain any punctuations. Also after 1 epoch, the loss was turned NaN. And I took inference on my validation set, it returned only empty string and a WER-CER of 100%.

@xieyidi, @SeanNaren do you have an idea why we encounter this issue after one epoch on my own non-English dataset?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

tqslj2 commented 4 years ago

i think it may be caused by the differences of sample-rate between trained data and inference data.

raotnameh commented 4 years ago

@atifemreyuksel Hi, so I was able to solve it, do this. Train the model on a batch size of "1" and save all the files with the CTC loss. After this remove the files with 'NAN' loss from the training CSV file. Train on the remaining files. The reason for the empty string is, once the loss turns 'NAN' everything goes bananas.

atifemreyuksel commented 4 years ago

Thank you @raotnameh. Also, I cleaned the data to solve this issue. The probability of getting NaN loss decreased after that.

The second point I changed is using ADAM optimizer instead of SGD. Interestingly, this change enables me to get rid of getting NaN loss and training crash.

raotnameh commented 4 years ago

@atifemreyuksel yeah, ADAM is preferred instead of vanilla SGD, if the loss isn't stable. Thanks for pointing it out.

SeanNaren commented 4 years ago

Let's make ADAM an option in deepspeech.pytorch, we've also seen stability using it

kouohhashi commented 4 years ago

@raotnameh Hi, I have a question. Did you find any pattern of wav files which caused nan loss problem? I got nan loss error after 10+ epochs with SpecAugment and Tempo/Gain Perturbation.

I'm trying to figure out what case nan loss exactly but so far no clue. My hunch is some augmentation morph a wav file to cause nan loss.

raotnameh commented 4 years ago

@kouohhashi Hi, I did not manually check the files which were causing the problem. Instead, I just removed the files from training.

But from my experience, lookout for files with shorter duration (e.g., less than 0.5 seconds).

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Ihebzayen commented 2 years ago

I'm training on my own data, the loss decreases while training but at the time of inference, the WER comes to 100% and so does the CER. Also, when I checked the output, it's always an empty string.

Additionally, I did inference using a model trained on librispeech and it works, I get a WER of 54. I'm not sure, why when I use a model trained on my own data it gives an empty string and a WER of 100%?

Any comments on why it's happening?

FYI the dataset is in English.

hello raotnameh, i am using your repo to train an end2end ner from speech model and i get the same issue, can you tell me what the solution please? thanks in advance.