SeanNaren / deepspeech.torch

Speech Recognition using DeepSpeech2 network and the CTC activation function.
MIT License
260 stars 73 forks source link

What is the possible improvement on the performance of this project? #51

Closed chanil1218 closed 7 years ago

chanil1218 commented 7 years ago

I've trained LibriSpeech train-clean-100 dataset with 20 epoch and get this results.

Training Epoch: 20 Average Loss: 2.504252 Average Validation WER: 40.84 Average Validation CER: 3.60 It doesn't seem maximally optimized as done in the paper(They suggest 1% of their dataset, 120 hours and got WER 29.23).

But the result seems encouraging for catching up their performance by adapting few techniques. Like SortaGrad or Language Models.

Could you suggest me what part of the paper is not implemented in here or possible improvements, then I could work on it improving performance of this project!

SeanNaren commented 7 years ago

I'll hopefully be able to answer this soon when I train a model on the full 1k hour Libri dataset, but I wonder if baidu's results was the WER with a language model... That would have a nice reduction on the WER!

shantanudev commented 7 years ago

I actually just got an EC2 instance with more powerful GPUs so I can use larger batch sizes. I will be experimenting with parameters. As for the question of whether a language model is used, check out this resource (from one of the Baidu Engineers). https://svail.github.io/mandarin/

The Key Part of the blog: "Deep Speech predicts characters directly, it is learning a language model of its own. In theory it can model the long-term dependencies of language and doesn’t necessarily need a language model."

chanil1218 commented 7 years ago

I just found this projects implements SortaGrad with initial sorted LMDB data and permuteBatchOrder.

And while seeing this lecture, Deep Learning for Speech Recognition (Adam Coates, Baidu) - around 1:15:10, normalizing utterances so that batches have similar lengths by padding some short utterances would improve performance after first sorted epoch.

What do you think?

And I will work on adapting language model and train again with the same other settings.

SeanNaren commented 7 years ago

@chanil1218 thanks for your input! Yeah currently I usepermuteBatchOrder to implement sortaGrad for the first epoch. The mini-batches currently are also of similar lengths since they are batched by order of size, is that what you meant?

chanil1218 commented 7 years ago

I'm not sure but the idea is little bit different because when we decide the batch size in un-normalized inputs we end up concerning longest size inputs due to possible memory exceed. Then smaller piece of audio would take same one batch with left space with the 0s.

They suggests padding smaller audio files into one piece to normalize size of all of the possible audios to take similar memory space. So then we don't have to waste.

The illustration would like this. For 3 batch size. Before normalize audio, -------- 0s
--------------- 0s
To nomalize(pad smaller audios to take one batch), -------- -------------- (pad two smaller audios in one)
--------------- ------

For utilizing memory efficiently.

For first sorted case would same issue because upper bound determined by batch size(concerning longest audio, like 12000MB / 20 batches). Earlier short audios, -------
---------
But the last case, ---------------------
-----------------------

I'm not just telling initial SortaGrad epoch.

2016년 10월 4일 (화) 오후 9:45, Sean Naren notifications@github.com님이 작성:

@chanil1218 https://github.com/chanil1218 thanks for your input! Yeah currently I usepermuteBatchOrder to implement sortaGrad for the first epoch. The mini-batches currently are also of similar lengths since they are batched by order of size, is that what you meant?

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/SeanNaren/deepspeech.torch/issues/51#issuecomment-251377270, or mute the thread https://github.com/notifications/unsubscribe-auth/ABWrZfB_iWmdp5FnVbBSC39VUPc0iKVGks5qwkpPgaJpZM4KLvOB .

nn-learner commented 7 years ago

@chanil1218 Can you report what hyperparameters you have used or have tried? Did you modify the model at all or the maxnorm?

chanil1218 commented 7 years ago

@nn-learner I just modified maxNorm to 100

SeanNaren commented 7 years ago

I also have to set the maxNorm to 100 to prevent exploding gradients. I'm going to revise the structure to maybe incorporate LSTMs instead of RNNs which have proven to be easier to train, but this goes away from the structure of DS2 thus isn't set in stone!

zssloth commented 7 years ago

Revise the structure may work, I change the network form BRNNs to BLSTMs results in better performance, with 28.37% WER after 30~50 epochs.

SeanNaren commented 7 years ago

I'll run tests using LSTMs instead of RNNs for AN4. If it gets better results (with roughly the same parameters) then I'll replace the RNNs with LSTMs on the main branch.

nn-learner commented 7 years ago

@ZhishengWang I don't have a powerful set of GPUs. It takes forever to get the WER rate down. Do you mind sharing the weights for the 100 hours that it was trained on?

zssloth commented 7 years ago

@nn-learner Currently I only have a trained model with default architecture (BRNNs) and it gets about 42.15% WER on the LibriSpeech train-clean-100 dataset. If it's helpful to you, I will be glad to share the trained model.

SeanNaren commented 7 years ago

It's the current issue I'm dealing with (not enough compute power), once I've knuckled down the architecture that doesn't blow up as easily (gradient wise), I'll spin up a server and train on the 1k dataset.

nn-learner commented 7 years ago

@ZhishengWang Yes, I would actually appreciate even those weights. The best I have been able to perform is 50% on 1x12gb GPU.

nn-learner commented 7 years ago

@SeanNaren Yes, I will probably try to launch those new Amazon K80 Instances they have as well.

maciejkorzepa commented 7 years ago

I have a couple of question regarding the training performance. First of all I observed that time and GPU memory needed to process each batch constantly increase in the first epoch. But I assume that's expected behaviour as the first epoch uses SortaGrad, right? What bothers me is that after going through all batches in the training, the memory usage is around 70%-85% (depending on the batch size) on Titan X and then suddenly it goes up and the script crashes due to out of memory error (produced by cutorch). Can an increase of 2-4GB in GPU usage after processing all batches be somehow explained? Does the validation consume so much resources?

The second issue with Titan X was that I was often getting inf losses and further training was impossible. But I didn't change the maxNorm, maybe that could be the problem?

I have also run training on 2xK80 and it seems that batch size of 20 is around maximum what can be handled by a single K80. The training time was around 1h17m per epoch for 100h Librispeech. Seems quite long, doesn't it? @SeanNaren You are trying to train the model on full 1k Libri dataset, so you have probably got decent training performance for smaller sets like 100 hours. Can you share what setup you use and whether you changed any hyperparameters other than maxNorm?

And one last question: how can I save the model during training? "-saveModelInTraining true" results in error...

zssloth commented 7 years ago

Hi, @nn-learner, try this link and see if it's available. (@nn-learner , I use the file hosting service offered by Baidu Cloud, it seems that you can't get access to the link. Try this again: https://pan.baidu.com/s/1dF7LTjN If it doesn't work, you can email me and I can send the model to you through email )

nn-learner commented 7 years ago

@ZhishengWang Hi, I don't think it is the correct link. It says page does not exit.

fanskyer commented 7 years ago

do you guys have 1k hour Libri data training performance? I can reach something like 19% CER and 50% WER on the clean test set. thanks

shantanudev commented 7 years ago

When you guys have been training the data, I know you have been using the 100 hours Librispeech data. What validation dataset are you using? Is it the test set, "clean" speech (http://www.openslr.org/resources/12/test-clean.tar.gz)?

SeanNaren commented 7 years ago

Have a look here for WER/CERs as well as the pre-trained models!

I'll close this issue for now but there is a pretty blatant piece missing from the package (but its not simple!). Using a language model and a CTC beam decoder would help drastically since the training data is fairly small. But this does take work and time!

AdolfVonKleist commented 7 years ago

What hardware configuration did you end up using for the LibriSpeech training? Did you use the EC2 K80? Roughly how long did it take?

On Thu, Jan 5, 2017 at 4:23 PM, Sean Naren notifications@github.com wrote:

Have a look here https://github.com/SeanNaren/deepspeech.torch#pre-trained-networks for WER/CERs as well as the pre-trained models!

I'll close this issue for now but there is a pretty blatant piece missing from the package (but its not simple!). Using a language model and a CTC beam decoder would help drastically since the training data is fairly small. But this does take work and time!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SeanNaren/deepspeech.torch/issues/51#issuecomment-270669802, or mute the thread https://github.com/notifications/unsubscribe-auth/AANaAy29GGIFL5PlaGRkZZXdam6Rg1qFks5rPQr6gaJpZM4KLvOB .

SeanNaren commented 7 years ago

@AdolfVonKleist I used the p2.xlarge instance and it took around a week and a bit to train (giving a rough estimate because I had to restart training a few times from the saved models)