Closed kthworks closed 1 year ago
DDP work under the principal of summing of gradients in the easiest case, so yes 32 GPUs * bs 32 = 1024 Global batch size per step.
You can mimic this with grad accumulation - with 1 GPU. However note that you would need vastly more steps. 32 times more to be roughly estimating it.
For example - with first case, each step is an update of batch size 1024. With second case, each step is an update with batch size of 32. So you would need 32 * 32 forward passes to match the batch size of the first case.
We have not tried such large gradient accumulation and would not recommend it. You can try to get GBS of 256 on one GPU, it might be enough.
For Korean language id suggest using Conformers, it is more stable. If you do need to use Citrinet, you should use a pretrained model for the encoder and attach new decoder to smooth training (see CTC finetuning tutorial for method) and train for longer. 384 also may not be large enough to learn such a large vocabulary of Korean language
Thank you for your comments. It was great help for me!
Hello,
The paper of CitriNet said they used 32 V100 GPUs, with a batch size of 32 per GPU. then, the gradients of each GPU are accumulated and updated together?
When I trained CitriNet-384 model with single gpu, with batch size of 32, then grad_accumulation=32 will be same condition with the paper?
Also, Im training Multilingual ASR model which includes English (LibriSpeech 960h, wpe 256) and Korean (KsponSpeech 1000h, bpe 1641) with CitriNet-384. I made the hyperparameters same with paper (NovoGrad optimizer with learning rate (LR) of 0.05, β1 = 0.8, β2 = 0.25 and weight decay of 0.001). However, validation_loss seem to not decrease until epoch 18.
is there any tips for this? Im testing with lr=0.01 now.
Thank you.