SeanNaren / deepspeech.torch

Speech Recognition using DeepSpeech2 network and the CTC activation function.
MIT License
260 stars 73 forks source link

Wrong recognition results #77

Closed ismorphism closed 7 years ago

ismorphism commented 7 years ago

Hi all! I trained Deep speech model with this command:

th Train.lua -nGPU 1 -epochSave -batchSize 25 -validationBatchSize 25 -permuteBatch -trainingSetLMDBPath/prepare_datasets/libri_lmdb/train/ -validationSetLMDBPath /prepare_datasets/libri_lmdb/test/ -modelTrainingPath /model_test/ -saveFileName NewOne.t7 -epochs 100

As I remember correctly this is pure DeepSpeech model. I have got following results:

Training Epoch: 83 Average Loss: 0.004611 Average Validation WER: 17.13 Average Validation CER: 3.96

I used for training path the file test-clean-100.tar.gzand for validation path I used the union of test-clean.tar.gz and dev-clean.tar.gz. But when I tried to recognize any .wav or .lac files (including training/testing files ) I got some meaningless trash. It's very strange because of above mentioned learning results. Did anyone have similiar problem? Could someone explain what's wrong?

yfletberliac commented 7 years ago

Hi @morphism90, I suspect this is due to the way you recorded/converted your .wav and .lac files. Did you use any sound editor to do so eg. Audacity to take care of the sample rate and the extension? Also if I remember correctly, using the same file extension as the one used in FormatLibriSpeech.lua helped.

ismorphism commented 7 years ago

Th weirdest thing that I used for model testing default .flac files from test-clean, train-clean directories. None of them was recognized succesfully.

yfletberliac commented 7 years ago

Ah yes this sounds weird. Could you print the error logs you get when doing this if you can reproduce them maybe? You run the tests with Predict.lua right?

ismorphism commented 7 years ago

Yes, I used Predict.lua. Do you mean error logs with CER and WER values?

yfletberliac commented 7 years ago

@morphism90 ok I read too quickly, the run indeed succeeded but the transcription didn't :-) So if you used the same extension and sample rate for Train.lua and Predict.lua I don't really see what you did wrong here...

ismorphism commented 7 years ago

For prediction I used following command: th Predict.lua -modelPath deepspeech.torch/models/model_epoch_deepspeech.t7 -audioPath sample.wav

and if in original audio track we hear --> "I love you". It recognized it as "y o". Looks bad. Other tests show the same situation.

ismorphism commented 7 years ago

Also, I see that when I used AN4 dataset I got good results on test and train sets in comparison to LibriSpeech dataset. Maybe, the key difference is that in AN4 case we used in training an4.dic but not default ./dictionary folder?

yfletberliac commented 7 years ago

It shouldn't be an issue, I used the default ./dictionary and it was fine. When you say

For prediction I used following command: th Predict.lua -modelPath deepspeech.torch/models/model_epoch_deepspeech.t7 -audioPath sample.wav

was there also transcription problems with .flac files instead of .wav?

ismorphism commented 7 years ago

There are no problems with transcription

yfletberliac commented 7 years ago

Ok, so I think that if you trained your model with -audioExtension flac in FormatLibriSpeech.lua you also need to use .flac files when testing your model's performances. It seems that file extensions are critical here and they need to correspond for training and testing.

ismorphism commented 7 years ago

Yeah, I agree with you but my model doesn't even work for training examples in .flac format.

SeanNaren commented 7 years ago

@yfletberliac thanks for your help so far :)

So just to clear a few things, it seems like from the log the model did train okish, my questions are:

What audio are you giving the model currently to predict on? Is the audio in the same format that the audio given to the model is in (16 bit, 16khz audio) for training?

ismorphism commented 7 years ago

Yes. I used http://www.online-convert.com/ for online transfer to desired .flac format

suhaspillai commented 7 years ago

@morphism90 I don't think file extensions are critical because when you use audio.spectrogram(...) you convert your speech file to (frequency * time) matrix . I trained the model on .flac files and I have also tested on .wav format files , I never faced this issue.

What I would recommend is just loading the model in torch terminal and doing a forward pass on sample.wav or sample.flac and check their output.

Also, did you try SoX for doing the conversion of audio?

ismorphism commented 7 years ago

Ok, I tried SoX for conversion. Maybe I think there is a problem with datasets structure. When I trained on AN4 it shows fine recognition results on training datasets. But in the case when I used train-clean-100.tar.gz as libri_datasets/train and dev-clean and test-clean I got above mentioned issue.

markmuir87 commented 7 years ago

Did you happen to train AN4 first? If that's the case your problem might be related to #78 .

ismorphism commented 7 years ago

Thank you a lot @markmuir87 ! It helps and now my prog runs in appropriate way.