Closed jalajthanaki closed 2 years ago
How are the nodes interconnected? If you're using AWS, then the standard Ethernet interconnect is quite slow (due to having to transfer from GPU->CPU->Ethernet->CPU->GPU), so it's expected that 2 nodes is similar speed to 1 node. Thereafter you should get a speedup, for example 4 nodes should be 2x faster than 2 nodes.
You might want to try Amazon's EFA interface, which is much faster. Here's a benchmark I did a while back on BERT, but trends are similar for translation.
Here's another good thread about this: https://github.com/aws/aws-ofi-nccl/issues/19
Thank you so much @myleott for this suggestion and for sharing the benchmark results. Yes, we are using AWS Ethernet. We will try out the distributed training using AWS EFA.
This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!
Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!
❓ Questions and Help
What is your question?
We have explored and tried out distributed training for the English-German NMT task. We are able to run distributed training. Here is some hardware resource vs time benchmark which we have obtained.
Training logs and command
Single instance with MultiGPU
fairseq-train data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 0.0005 --min-lr 1e-09 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 8000 --save-dir checkpoints/fconv &> expout.log
Multiple instances with MultiGPU
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr="10.7.6.21" --master_port=5000 $(which fairseq-train) data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0005 --min-lr 1e-09 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 8000 --distributed-no-spawn &> expout.log
python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="10.7.6.21" --master_port=5000 $(which fairseq-train) data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0005 --min-lr 1e-09 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 8000 --distributed-no-spawn &> expout.log
Attaching the logs and network utilization
What have you tried?
I'm referring to this fairseq documentation
What's your environment?
Thanks in advance