Are there any documentation around distributed training speed and resource related benchmark

jalajthanaki commented 4 years ago

❓ Questions and Help

What is your question?

We have explored and tried out distributed training for the English-German NMT task. We are able to run distributed training. Here is some hardware resource vs time benchmark which we have obtained.

H/W resources	Single Node	H/W resources	MultiNode (2 Nodes)	Delta
K80 GPU (8 GPUs on single node	3h 23 mins	K80 (16 GPUS across nodes)	2h 52 mins	30 mins speed gain
V100 GPU (8 GPUs on single node)	1h 20 mins	V100 GPU (16 GPUS across nodes)	1h 36 mins	16 mins increased
V100 GPU (8 GPUs with 32 GB RAM on single node)	1h 10 mins	V100 GPU (16 GPUs with 32 GB RAM on single node)	59 mins	15 mins speed gain

We are expecting that when we are increasing the GPUs/nodes (double the GPUs) the training time should be decreased by half but that is not happening. We are getting only 15-20 mins saving in times. Can you please help us here or redirect us to certain documentation?
Tried out to increase the resources by 4 times to see if time decreased by half or not but we can see the training speed is even less. Any suggestions?
Are there any parameters that we need to change?

Training logs and command

Single instance with MultiGPU fairseq-train data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr 0.0005 --min-lr 1e-09 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 8000 --save-dir checkpoints/fconv &> expout.log

Multiple instances with MultiGPU python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr="10.7.6.21" --master_port=5000 $(which fairseq-train) data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0005 --min-lr 1e-09 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 8000 --distributed-no-spawn &> expout.log

python -m torch.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="10.7.6.21" --master_port=5000 $(which fairseq-train) data-bin/iwslt14.tokenized.de-en --arch transformer_iwslt_de_en --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 --lr 0.0005 --min-lr 1e-09 --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens 8000 --distributed-no-spawn &> expout.log

Attaching the logs and network utilization

What have you tried?

I'm referring to this fairseq documentation

What's your environment?

fairseq Version : 0.9
PyTorch Version: 1.5
OS (e.g., Linux):
CUDA/cuDNN version: 9.2
GPU models and configuration: V80, V100
Any other relevant information: NCCL backend

Thanks in advance

myleott commented 4 years ago

How are the nodes interconnected? If you're using AWS, then the standard Ethernet interconnect is quite slow (due to having to transfer from GPU->CPU->Ethernet->CPU->GPU), so it's expected that 2 nodes is similar speed to 1 node. Thereafter you should get a speedup, for example 4 nodes should be 2x faster than 2 nodes.

You might want to try Amazon's EFA interface, which is much faster. Here's a benchmark I did a while back on BERT, but trends are similar for translation. bert_scaling_aws

myleott commented 4 years ago

Here's another good thread about this: https://github.com/aws/aws-ofi-nccl/issues/19

jalajthanaki commented 4 years ago

Thank you so much @myleott for this suggestion and for sharing the benchmark results. Yes, we are using AWS Ethernet. We will try out the distributed training using AWS EFA.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale. If this issue is still affecting you, please leave any comment (for example, "bump"), and we'll keep it open. We are sorry that we haven't been able to prioritize it yet. If you have any new additional information, please include it with your comment!

stale[bot] commented 2 years ago

Closing this issue after a prolonged period of inactivity. If this issue is still present in the latest release, please create a new issue with up-to-date information. Thank you!

facebookresearch / fairseq