baidu-research / ba-dls-deepspeech

Apache License 2.0
486 stars 174 forks source link

Suggestions for improving dev-set performance. #4

Open Feynman27 opened 7 years ago

Feynman27 commented 7 years ago

(I apologize if this question is better suited for StackOverflow, but I figure posting it here will reach the right audience in a shorter amount of time.)

I'm training this CTC-cost model on the Librispeech "train-other-500" dataset, which contains 500 hours of speech audio+transcripts. I'm using the "dev-other" data set for development, which is apparently a more challenging audio set to model.

I trained the model over 20 epochs and have provided the distribution of the costs below. image

The weights are updated according to Nesterov momentum.

Since the validation performance plateaus at around iter=25000, I decided to checkpoint the model here and continue running the model using an exponential learning-rate decay schedule. The learning rate is decreased after each epoch (starting from iter=25000). The CTC costs using this learning-rate decay schedule are shown below after a few epochs:

image

Unfortunately, this strategy doesn't appear to improve the model performance. Does anyone have any suggestions on how to improve the model other than what I've described above?

srvinay commented 7 years ago

From the looks of it, your model seems to have high variance. You should try reducing the initial learning rate, add in regularization (dropout/augment with noise) or play with the model architecture if these ideas don't work.

dylanbfox commented 7 years ago

You can also try increasing the data you're training on. By default the max wav length is set to 10 seconds (https://github.com/baidu-research/ba-dls-deepspeech/blob/master/data_generator.py#L53-L54) which excludes a good portion of the data in the LibriSpeech corpora. Longer utterances most likely will require more memory usage though.