NervanaSystems / deepspeech

DeepSpeech neon implementation
Apache License 2.0
222 stars 69 forks source link

How to improve WER and integrate with a LM ? #40

Closed AlexAlgo closed 7 years ago

AlexAlgo commented 7 years ago

Hi,

I have few questions after training this model for more than one week.

First, I trained the model for 2 days and the WER was 58%. Next, I continued training it for another 2 days and WER was 57.55%. #Then, I trained it in more 3 days, though the ratio remained the same, 57.55%. The total epochs trained after one week are 63, my batch_size parameter is 32. Could you please suggest how to improve WER? (In your blog post, the WER could reach 32.5% without a LM model. What is your batch_size parameter in that case? How many days was it to get that result?)

Second, could you please give some suggestions on how to integrate this model with a language model to make an end-to-end solution?

Thanks! Alex

#

dimatter commented 7 years ago

@AlexAlgo what hardware do you use to get 3 hours per epoch?

AlexAlgo commented 7 years ago

@dimatter The GPU card is NVIDIA Tesla K80.

Neuroschemata commented 7 years ago

Which dataset are you using to train the model?

AlexAlgo commented 7 years ago

@Neuroschemata I used Librispeech data set.

Neuroschemata commented 7 years ago

Are you using the entire 960 hours of training data or are you using a portion of the training data? If you use only a portion of the training data, then the WER will definitely be lower. Using the training script as is on the entire dataset should get you to a WER of 32.5

As for integrating a language model, there's not much more we can add to what's stated in #22. EESEN is a good example to follow since it leverages Kaldi's WFST decoders. The same technique can be adapted to work with any framework that uses CTC.

AlexAlgo commented 7 years ago

I used the entire 960 hours of the training data Librispeech. My case is weird that WER stopped at 57.55% and did not improve after training for more 2 and 4 days. I though it could be due to its batch_size parameter since the accuracy of stochastic gradient descent, batch gradient descent and mini-batch gradient descent are different. Do you think so? In order reach WER of 32.5, what is the batch_size of that model? Also, thank you for the info on Language Model.

Neuroschemata commented 7 years ago

If you are using a single GPU, then a batch size of 32 should be OK. We have trained the same model using an effective batch size of 32 per GPU and obtained a WER under 35. What values are you using for the other hyperparameters? I assume that you're using batch norm. One "trick" that usually improves performance is to make sure that the mini-batches contain examples of similar length. Using the "sortagrad" recipe from the original Deep Speech 2 paper also improves things.

AlexAlgo commented 7 years ago

Yes, I may need to try with SGD, in that case batch_size is 1, right? Here is my last training command: python train.py --manifest train:/root/deepspeech/librispeech/train-clean-100/train-manifest.csv --manifest val:/root/deepspeech/librispeech/train-clean-100/val-manifest.csv -e63 -z32 -s /deepspeech/model_output.pkl --model_file /deepspeech/model_output.pkl

Neuroschemata commented 7 years ago

That last training command shows you're training on train-clean-100? Is that what you always train on?

AlexAlgo commented 7 years ago

Yes it is. I trained on that data set. Is it correct to get WER of 32.5%?

Neuroschemata commented 7 years ago

That's the problem. You're not going to get 32.5% with train-clean-100. That's only 100 hours worth of data. A WER of 58% for a model trained on train-clean-100 is expected.

AlexAlgo commented 7 years ago

I got it. Thank you! Last on the training statement python train.py --manifest train:<training manifest> --manifest val:<validation manifest> -e <num_epochs> -z <batch_size> -s </path/to/model_output.pkl> [-b <backend>], could you advise:

  1. What is the recommended value for batch_size parameter to get the best accuracy?, and
  2. What is the optional parameter [-b <backend>]used for?
Neuroschemata commented 7 years ago

I would suggest going with the largest batch size that will fit in memory. In the case of librispeech, a batch size of 32 will give pretty good results. The backend parameter selects the device to run on. For this model, you want to run on gpu.