Was trying to launch a distributed job with 2 nodes each with 4GPU using fairseq-hydra-train. Single node multigpu using fairseq-hydra-train without torch.distributed.run can run successfully.
However, RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8 was thrown on every node at the beginning of the multi-node job (full error trace below). It was thrown at the beginning while distributed_init is verifying all nodes starts correctly.
I searched past issues but did not find one that uses torch.distributed.run and fairseq-hydra-train
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Traceback (most recent call last):
File "/storage/home/ec2-user/workspaces/hippo-workspace/src/fairseq/fairseq_cli/hydra_train.py", line 45, in hydra_main
distributed_utils.call_main(cfg, pre_main)
File "/storage/home/ec2-user/workspaces/hippo-workspace/src/fairseq/fairseq/distributed/utils.py", line 354, in call_main
distributed_main(cfg.distributed_training.device_id, main, cfg, kwargs)
File "/storage/home/ec2-user/workspaces/hippo-workspace/src/fairseq/fairseq/distributed/utils.py", line 322, in distributed_main
cfg.distributed_training.distributed_rank = distributed_init(cfg)
File "/storage/home/ec2-user/workspaces/hippo-workspace/src/fairseq/fairseq/distributed/utils.py", line 274, in distributed_init
dist.all_reduce(torch.zeros(1).cuda())
File "/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1206, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc).
Code sample
Expected behavior
Distributed should launch without a problem
Environment
fairseq Version (e.g., 1.0 or master): master (40f6c758b3)
PyTorch Version (e.g., 1.0): 1.9
OS (e.g., Linux): Linux
How you installed fairseq (pip, source): pip install -e .
Build command you used (if compiling from source):
Python version: 3.6
CUDA/cuDNN version: 10.0
GPU models and configuration: 4 NVIDIA Tesla V100 GPUs
🐛 Bug
Was trying to launch a distributed job with 2 nodes each with 4GPU using fairseq-hydra-train. Single node multigpu using fairseq-hydra-train without
torch.distributed.run
can run successfully. However,RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, invalid usage, NCCL version 2.7.8
was thrown on every node at the beginning of the multi-node job (full error trace below). It was thrown at the beginning while distributed_init is verifying all nodes starts correctly.I searched past issues but did not find one that uses
torch.distributed.run
andfairseq-hydra-train
To Reproduce
Steps to reproduce the behavior (always include the command you ran):
python -m torch.distributed.run --nproc_per_node=4 --nnodes=2 --rdzv_id=my_job --rdzv_backend=c10d --rdzv_endpoint=10.0.6.88:1234 <path-to>/fairseq-hydra-train ...
Code sample
Expected behavior
Distributed should launch without a problem
Environment
pip
, source):pip install -e .
Additional context