facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.38k stars 6.4k forks source link

Are there any documentation around distributed training speed and resource related benchmark #2487

Closed jalajthanaki closed 2 years ago

jalajthanaki commented 4 years ago

❓ Questions and Help

What is your question?

We have explored and tried out distributed training for the English-German NMT task. We are able to run distributed training. Here is some hardware resource vs time benchmark which we have obtained.

H/W resources Single Node H/W resources MultiNode (2 Nodes) Delta
K80 GPU (8 GPUs on single node 3h 23 mins K80 (16 GPUS across nodes) 2h 52 mins 30 mins speed gain
V100 GPU (8 GPUs on single node) 1h 20 mins V100 GPU (16 GPUS across nodes) 1h 36 mins 16 mins increased
V100 GPU (8 GPUs with 32 GB RAM on single node) 1h 10 mins V100 GPU (16 GPUs with 32 GB RAM on single node) 59 mins 15 mins speed gain

Training logs and command

Single instance with MultiGPU fairseq-train data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 0.0005 --min-lr 1e-09 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 8000 --save-dir checkpoints/fconv &> expout.log

Multiple instances with MultiGPU python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr="10.7.6.21" --master_port=5000 $(which fairseq-train) data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0005 --min-lr 1e-09 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 8000 --distributed-no-spawn &> expout.log

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="10.7.6.21" --master_port=5000 $(which fairseq-train) data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0005 --min-lr 1e-09 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 8000 --distributed-no-spawn &> expout.log

Attaching the logs and network utilization

What have you tried?

I'm referring to this fairseq documentation

What's your environment?

Thanks in advance

myleott commented 4 years ago

How are the nodes interconnected? If you're using AWS, then the standard Ethernet interconnect is quite slow (due to having to transfer from GPU->CPU->Ethernet->CPU->GPU), so it's expected that 2 nodes is similar speed to 1 node. Thereafter you should get a speedup, for example 4 nodes should be 2x faster than 2 nodes.

You might want to try Amazon's EFA interface, which is much faster. Here's a benchmark I did a while back on BERT, but trends are similar for translation. bert_scaling_aws

myleott commented 4 years ago

Here's another good thread about this: https://github.com/aws/aws-ofi-nccl/issues/19

jalajthanaki commented 4 years ago

Thank you so much @myleott for this suggestion and for sharing the benchmark results. Yes, we are using AWS Ethernet. We will try out the distributed training using AWS EFA.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!