NCCL error while using multinodal distributed training with fairseq-hydra-train and torch.distributed.run

🐛 Bug

Was trying to launch a distributed job with 2 nodes each with 4GPU using fairseq-hydra-train. Single node multigpu using fairseq-hydra-train without torch.distributed.run can run successfully. However, RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8 was thrown on every node at the beginning of the multi-node job (full error trace below). It was thrown at the beginning while distributed_init is verifying all nodes starts correctly.
I searched past issues but did not find one that uses torch.distributed.run and fairseq-hydra-train

To Reproduce

Steps to reproduce the behavior (always include the command you ran):

Ran python -m torch.distributed.run --nproc_per_node=4 --nnodes=2 --rdzv_id=my_job --rdzv_backend=c10d --rdzv_endpoint=10.0.6.88:1234 <path-to>/fairseq-hydra-train ...

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Traceback (most recent call last):
  File "/storage/home/ec2-user/workspaces/hippo-workspace/src/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
    distributed_utils.call_main(cfg, pre_main)
  File "/storage/home/ec2-user/workspaces/hippo-workspace/src/fairseq/fairseq/distributed/utils.py", line 354, in call_main
    distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
  File "/storage/home/ec2-user/workspaces/hippo-workspace/src/fairseq/fairseq/distributed/utils.py", line 322, in distributed_main
    cfg.distributed_training.distributed_rank = distributed_init(cfg)
  File "/storage/home/ec2-user/workspaces/hippo-workspace/src/fairseq/fairseq/distributed/utils.py", line 274, in distributed_init
    dist.all_reduce(torch.zeros(1).cuda())
  File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1206, in all_reduce
    work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).

Code sample

Expected behavior

Distributed should launch without a problem

Environment

fairseq Version (e.g., 1.0 or master): master (40f6c758b3)
PyTorch Version (e.g., 1.0): 1.9
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): pip install -e .
Build command you used (if compiling from source):
Python version: 3.6
CUDA/cuDNN version: 10.0
GPU models and configuration: 4 NVIDIA Tesla V100 GPUs
Any other relevant information:

facebookresearch / fairseq