NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

https://nvidia.github.io/OpenSeq2Seq

Apache License 2.0

1.54k stars 369 forks source link

[Question] Expected WER on train-clean data set using different models? #392

Closed shiv6146 closed 5 years ago

shiv6146 commented 5 years ago

I have trained 3 different models on a 12GB single Titan XP GPU with mixed precision but without using CTC beam decoder on a train-clean Librispeech dataset. My results are as follows:

Deepspeech2 (100 epochs) => Eval WER: 32.78%
Jasper10x5 (200 epochs) => Eval WER: 15.88%
Wave2Letter++ (200 epochs) => Eval WER: 15.83%

Approx time taken to complete 1 epoch is around 30 mins. Is this expected? Are there any recommendations that can help me fine tune my training to achieve better results?

borisgin commented 5 years ago

What optimizer have you used? Learning rate policy and other training parameters?

blisc commented 5 years ago

Please retrain with float32 and let us know your results. Mixed Precision will not work properly on a Titan XP.

Can you clarify which subsets you are training and evaluating on?

shiv6146 commented 5 years ago

What optimizer have you used? Learning rate policy and other training parameters?

@borisgin : Here are my hyperparams used for training

DS2 Hyperparams

batch_size_per_gpu => 16
optimizer => Adam
lr_policy => exp_decay
learning_rate => 0.0001
dtype => mixed
loss_scaling => Backoff
regularizer => l2
initializer => xavier
activation_fn => ReLU
encoder => DeepSpeech2Encoder

JSP10x5 Hyperparams

batch_size_per_gpu => 16
optimizer => Momentum
lr_policy => poly_decay
learning_rate => 0.01
dtype => mixed
loss_scaling => Backoff
regularizer => l2
initializer => xavier
activation_fn => lambda x: tf.minimum(tf.nn.relu(x), 20.0)
encoder => TDNNEncoder

Wav2Letter++ Hyperparams

batch_size_per_gpu => 16
optimizer => Momentum
lr_policy => poly_decay
learning_rate => 0.05
dtype => mixed
loss_scaling => Backoff
regularizer => l2
initializer => xavier
activation_fn => lambda x: tf.minimum(tf.nn.relu(x), 20.0)
encoder => TDNNEncoder

shiv6146 commented 5 years ago

Please retrain with float32 and let us know your results. Mixed Precision will not work properly on a Titan XP.

Can you clarify which subsets you are training and evaluating on?

@blisc : My training set is train-clean-100 and validation set is dev-clean

With reference to WER mentioned here it claims that those results are obtained from a dev-clean subset. Does it mean a training set or validation set ? DS2 results are way off for me when compared to other models. Your thoughts on this would help me a lot! Thanks!

blisc commented 5 years ago

Those results are trained on all of LibriSpeech (train-clean-100 + train-clean-360 + train-other-500) which is 10x larger than only train-clean-100.

DS2 is an older model compared to the other two so I'm not surprised it performs worse. You should also draw comparisons between models trained for the same number of epochs.

I will reiterate that you should not use mixed precision with a Titan XP ~~as it does not support float16.~~ as it does not have Tensor Cores.