Closed inchpunch closed 5 years ago
These experiments were done on 8x V100 GPUs with batch_size_per_gpu=64.
Thanks a lot for your information. I tried with 4x V100 GPUs, also batch per gpu=64, 50 epochs, using batch norm and ReLU, but validation on dev-clean got WER = 7.09%. Greedy decoder without LM. Was that too good?
This is reasonable WER. For J-10x5, I got WER=5.7 for 50 epochs using NovoGrad with init LR=0.02 and weight decay 0.001.
These are numbers for J-10x5: Train (epochs) greedy WER, % 100 - 4.87 200 - 4.44 400 - 4.23 600 - 4.06
I see. I am curious what made the difference between the published WER=8.82 and my test WER=7.09% for the J-5x3. I will appreciated it if you can point it out. I used these:
"optimizer": "Momentum",
"optimizer_params": {
"momentum": 0.90,
},
"lr_policy": poly_decay,
"lr_policy_params": {
"learning_rate": 0.01,
"min_lr": 1e-5,
"power": 2.0,
},
"larc_params": {
"larc_eta": 0.001,
},
"regularizer": tf.contrib.layers.l2_regularizer,
"regularizer_params": {
'scale': 0.001
},
"dtype": "mixed",
"loss_scaling": "Backoff",
We did experiments with activations/normalizations using slightly older configuration of the Jasper (in comparison with the latest jasper10x5_LibriSpeech_nvgrad.py. Two most important differences are:
The optimizer's parameters from your snippet look good. Have you applied any data augmentation?
No I did not apply data augmentation. Actually, I only modified the previous git version's jasper_10x3_8gpus_mp.py to remove the repetition of B1-B5, and then change num_gpus to 4 and num_epochs to 50. It took me 2 days to finish. Could you try that and see if the result is the same?
BTW, although I set num_gpus to 4, most of the time I saw only one gpu was actively used. The usage is like this:
I ran the command in tensorflow docker like this:
python run.py --config_file=example_configs/speech2text/jasper_5x3_4gpus_mp.py --mode=train_eval
Is there any parameters that I missed to use to enable 4 gpus?
The tests that were done in the paper were conducted on a slightly different version of Jasper. The changes from the provided config are as follows:
In terms of GPU utilization, have you disabled Horovod as well? Horovod is meant to be used with mpi4py (mpiexec --allow-run-as-root --np <num_gpus> python run.py ...
). The num_gpus parameter has no effect if Horovod is enabled.
Thanks for your explanation. Yes I enabled Horovod in previous tests. Thanks a lot for your answers.
In the paper Jason Li et al. Jasper: An End-to-End Convolutional Neural Acoustic Model, 2019 https://arxiv.org/pdf/1904.03288.pdf, Jasper 5x3 has the following results:
I am trying to reproduce some of the WERs but wonder how many GPUs are used for these results and what is the patch size per GPU?