What's the WER if training on full librispeech train set?

Alexander-H-Liu / End-to-end-ASR-Pytorch

This is an open source project (formerly named Listen, Attend and Spell - PyTorch Implementation) for end-to-end ASR implemented with Pytorch, the well known deep learning toolkit.

MIT License

1.18k stars 317 forks source link

What's the WER if training on full librispeech train set? #35

Open iamxiaoyubei opened 5 years ago

iamxiaoyubei commented 5 years ago

Has someone trained it on full librispeech train sets(train-clean-100, train-clean-360, train-other-500)? Could you tell the WER if training on them? Thank you!

Youyoun commented 5 years ago

Hi @iamxiaoyubei ,

I have tried training on librispeech 960h, using the libri960h_example.yaml (with a much smaller batch size since I didn't have enough memory) without using rnn language model, and I got around 24% on dev clean and test clean.

By tweaking some of the parameters (especially sampling rate and number of lstm cells), I got it to 14%.

Note that I'm clearly not training the network on as many epoch as I should (I trained on ~15 epochs instead of 80 or 100), so maybe that's the reason why my WER is so high.

Small Erratum: While the training did not last as long as it should, the curve usually shows that the model has reached a point of stagnation, where it gets less than 0.05 wer every epoch.

miraodasilva commented 5 years ago

Hello @Youyoun ,

Can you share the specific tweaks you made to the sampling rate and number of lstm cells? I would really appreciate it since I am about to train the model myself. Thanks a lot in advance!

Youyoun commented 4 years ago

Hey @miraodasilva !

Sorry for the delay. Well I basically tried to follow the model introduced in the SpecAugment paper, which is 4 LSTM layers with 1024 units each in the encoder, and 1 LSTM with 1024 units for the decoder.

If you're trying to follow the pyramidal structure of the encoder, then use 1 2 2 1.

miraodasilva commented 4 years ago

Hello @Youyoun ,

Ok, thanks a lot for the info!

tsxce commented 4 years ago

Hey @miraodasilva !

Sorry for the delay. Well I basically tried to follow the model introduced in the SpecAugment paper, which is 4 LSTM layers with 1024 units each in the encoder, and 1 LSTM with 1024 units for the decoder.

If you're trying to follow the pyramidal structure of the encoder, then use 1 2 2 1.

HI, can you share what sampling rate you used?