Open Bobrosoft98 opened 5 years ago
@okhonko
Hi,
I'm having similar results on 1 GPU for a different dataset. Could you share with us the parameters you used to improve the results?
Thank you
hi, i was having similar issues but was able to do better with the default settings on one gpu by simulating the larger batch size with --update-freq 16
@alexbie98 I actually used this parameter when training on 1 GPU, and it didn't help. Can you elaborate on "do better"? Did you replicate the paper's WER?
@carlosep93 My parameters were: --optimizer adam --lr 5e-4 --fp16 --memory-efficient-fp16 --warmup-updates 2500 --update-freq 4
I also changed the batching logic to pack as much data on each GPU as possible, resulting in the average batch size 670 for all 8 GPUs. Only after that it started properly training.
right now it's at 96% train acc/91.7% valid acc after training for 5 days (epoch 31). Have not yet matched the reported WER, getting 9.9 on the current checkpoint. The loss/acc plateaus for a bit before dropping quite low.
Wow, that looks nice! What batch size do you have? Also, could you share the accuracy plot?
https://i.imgur.com/dKadcXq.png
The effective batch size is 80k. My training command is the same as the one in the repo with --update-freq 16
Thanks for providing the plot! Are you sure about 80k? I think, the whole librispeech train set has around 200k utterances, which means 3 batches per epoch in your case.
sorry 80k tokens*, using the default command's --max-tokens 5000 with --update-freq 16, the average number of sentences is around 60
https://i.imgur.com/dKadcXq.png
The effective batch size is 80k. My training command is the same as the one in the repo with --update-freq 16 sorry for oot reply,
Could you share how do you plot the training accuracy?
https://i.imgur.com/dKadcXq.png The effective batch size is 80k. My training command is the same as the one in the repo with --update-freq 16 sorry for oot reply,
Could you share how do you plot the training accuracy?
If I recall correctly, specifying a directory to --tensorboard-logdir
will generate these plots viewable from tensorboard. I haven't used this a while though.
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Hi,
I am having trouble reproducing the speech recognition results. With the default settings, the model stagnates at 25% train accuracy. By employing a different optimizer, increasing the batch size and tuning the lr, I was able to reach 8% WER, but that is far from the claimed 5% without tuning.
Could you please provide additional info about your configuration (the model and number of GPUs, the total batch size), or even better: logs and/or model checkpoints?
Thank you.