Jasper 5x3 configuration for published WERs

NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP

https://nvidia.github.io/OpenSeq2Seq

Apache License 2.0

1.54k stars 369 forks source link

Jasper 5x3 configuration for published WERs #406

Closed inchpunch closed 5 years ago

inchpunch commented 5 years ago

In the paper Jason Li et al. Jasper: An End-to-End Convolutional Neural Acoustic Model, 2019 https://arxiv.org/pdf/1904.03288.pdf, Jasper 5x3 has the following results:

I am trying to reproduce some of the WERs but wonder how many GPUs are used for these results and what is the patch size per GPU?

vsl9 commented 5 years ago

These experiments were done on 8x V100 GPUs with batch_size_per_gpu=64.

inchpunch commented 5 years ago

Thanks a lot for your information. I tried with 4x V100 GPUs, also batch per gpu=64, 50 epochs, using batch norm and ReLU, but validation on dev-clean got WER = 7.09%. Greedy decoder without LM. Was that too good?

borisgin commented 5 years ago

This is reasonable WER. For J-10x5, I got WER=5.7 for 50 epochs using NovoGrad with init LR=0.02 and weight decay 0.001.

borisgin commented 5 years ago

These are numbers for J-10x5: Train (epochs) greedy WER, % 100 - 4.87 200 - 4.44 400 - 4.23 600 - 4.06

inchpunch commented 5 years ago

I see. I am curious what made the difference between the published WER=8.82 and my test WER=7.09% for the J-5x3. I will appreciated it if you can point it out. I used these:

"optimizer": "Momentum",
"optimizer_params": {
    "momentum": 0.90,
},
"lr_policy": poly_decay,
"lr_policy_params": {
    "learning_rate": 0.01,
    "min_lr": 1e-5,
    "power": 2.0,
},
"larc_params": {
    "larc_eta": 0.001,
},

"regularizer": tf.contrib.layers.l2_regularizer,
"regularizer_params": {
    'scale': 0.001
},

"dtype": "mixed",
"loss_scaling": "Backoff",

vsl9 commented 5 years ago

We did experiments with activations/normalizations using slightly older configuration of the Jasper (in comparison with the latest jasper10x5_LibriSpeech_nvgrad.py. Two most important differences are:

Optimizer (SGD with Momentum vs NovoGrad)
Augmentation (no augmentation vs speed perturbation)

The optimizer's parameters from your snippet look good. Have you applied any data augmentation?

inchpunch commented 5 years ago

No I did not apply data augmentation. Actually, I only modified the previous git version's jasper_10x3_8gpus_mp.py to remove the repetition of B1-B5, and then change num_gpus to 4 and num_epochs to 50. It took me 2 days to finish. Could you try that and see if the result is the same?

inchpunch commented 5 years ago

BTW, although I set num_gpus to 4, most of the time I saw only one gpu was actively used. The usage is like this:

I ran the command in tensorflow docker like this:

python run.py --config_file=example_configs/speech2text/jasper_5x3_4gpus_mp.py --mode=train_eval

Is there any parameters that I missed to use to enable 4 gpus?

blisc commented 5 years ago

The tests that were done in the paper were conducted on a slightly different version of Jasper. The changes from the provided config are as follows:

The 5x3 models were tested only with normal residual. (One needs to set Dense_Residual to False in the config)
The tests were conducted with python_speech_features which does not apply a windowing function to spectrograms
The output of the residual and convolutional ops were added prior to batch normalization; currently batch normalization is applied to each individually prior to adding
Regarding the number of gpus, we observed that WER tends to decrease with more training iterations so using less gpus will increase the number of required iterations, real world training time, but decrease WER.

In terms of GPU utilization, have you disabled Horovod as well? Horovod is meant to be used with mpi4py (mpiexec --allow-run-as-root --np <num_gpus> python run.py ...). The num_gpus parameter has no effect if Horovod is enabled.

inchpunch commented 5 years ago

Thanks for your explanation. Yes I enabled Horovod in previous tests. Thanks a lot for your answers.