NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

Jasper 5x3 configuration for published WERs #406

Closed inchpunch closed 5 years ago

inchpunch commented 5 years ago

In the paper Jason Li et al. Jasper: An End-to-End Convolutional Neural Acoustic Model, 2019 https://arxiv.org/pdf/1904.03288.pdf, Jasper 5x3 has the following results:

image

I am trying to reproduce some of the WERs but wonder how many GPUs are used for these results and what is the patch size per GPU?

vsl9 commented 5 years ago

These experiments were done on 8x V100 GPUs with batch_size_per_gpu=64.

inchpunch commented 5 years ago

Thanks a lot for your information. I tried with 4x V100 GPUs, also batch per gpu=64, 50 epochs, using batch norm and ReLU, but validation on dev-clean got WER = 7.09%. Greedy decoder without LM. Was that too good?

borisgin commented 5 years ago

This is reasonable WER. For J-10x5, I got WER=5.7 for 50 epochs using NovoGrad with init LR=0.02 and weight decay 0.001.

borisgin commented 5 years ago

These are numbers for J-10x5: Train (epochs) greedy WER, % 100 - 4.87 200 - 4.44 400 - 4.23 600 - 4.06

inchpunch commented 5 years ago

I see. I am curious what made the difference between the published WER=8.82 and my test WER=7.09% for the J-5x3. I will appreciated it if you can point it out. I used these:

"optimizer": "Momentum",
"optimizer_params": {
    "momentum": 0.90,
},
"lr_policy": poly_decay,
"lr_policy_params": {
    "learning_rate": 0.01,
    "min_lr": 1e-5,
    "power": 2.0,
},
"larc_params": {
    "larc_eta": 0.001,
},

"regularizer": tf.contrib.layers.l2_regularizer,
"regularizer_params": {
    'scale': 0.001
},

"dtype": "mixed",
"loss_scaling": "Backoff",
vsl9 commented 5 years ago

We did experiments with activations/normalizations using slightly older configuration of the Jasper (in comparison with the latest jasper10x5_LibriSpeech_nvgrad.py. Two most important differences are:

  1. Optimizer (SGD with Momentum vs NovoGrad)
  2. Augmentation (no augmentation vs speed perturbation)

The optimizer's parameters from your snippet look good. Have you applied any data augmentation?

inchpunch commented 5 years ago

No I did not apply data augmentation. Actually, I only modified the previous git version's jasper_10x3_8gpus_mp.py to remove the repetition of B1-B5, and then change num_gpus to 4 and num_epochs to 50. It took me 2 days to finish. Could you try that and see if the result is the same?

inchpunch commented 5 years ago

BTW, although I set num_gpus to 4, most of the time I saw only one gpu was actively used. The usage is like this:

image

I ran the command in tensorflow docker like this:

python run.py --config_file=example_configs/speech2text/jasper_5x3_4gpus_mp.py --mode=train_eval

Is there any parameters that I missed to use to enable 4 gpus?

blisc commented 5 years ago

The tests that were done in the paper were conducted on a slightly different version of Jasper. The changes from the provided config are as follows:

In terms of GPU utilization, have you disabled Horovod as well? Horovod is meant to be used with mpi4py (mpiexec --allow-run-as-root --np <num_gpus> python run.py ...). The num_gpus parameter has no effect if Horovod is enabled.

inchpunch commented 5 years ago

Thanks for your explanation. Yes I enabled Horovod in previous tests. Thanks a lot for your answers.