fairseq-hydra-train with multi-nodes distributed training

Hi, is there any instruction on multiple nodes multiple GPUs distributed training with hydra train? https://fairseq.readthedocs.io/en/latest/getting_started.html#distributed-training The fairseq documentation seems to be out-of-date, where hydra does not expect the local_rank argument passed by torch.distributed.launch. I tried replace torch.distributed.launch by torchrun which solved the local_rank issue but still didn't seem to make everything correct. Here is the command I tried, and got RuntimeError: Socket Timeout

export MASTER_ADDR=${master_addr}
export MASTER_PORT=12356
export WORLD_SIZE=${world_size}
export RANK=${node_rank}
torchrun --nnodes ${nnodes} --nproc_per_node ${nproc_per_node} $(which fairseq-hydra-train) --config-dir ./conf/finetune/ --config-name base_lrs3_30h.yaml \
  task.data=${data_dir} task.label_dir=${data_dir} \
  task.tokenizer_bpe_model=/gs/hs0/tga-tslab/bowen/Dataset/LRS3/spm1000/spm_unigram1000.model \
  model.w2v_path=/gs/hs0/tga-tslab/bowen/av_hubert/pretrain_model/base_lrs3_iter5.pt \
  hydra.run.dir=./exps/finetune/ common.user_dir=`pwd` \
  +override.distributed_training.distributed_world_size=${world_size} \
  +override.distributed_training.nprocs_per_node=${nproc_per_node} \
  +override.distributed_training.distributed_port=12356 \
  +override.optimization.update_freq=[${update_freq}] \

Hi,

On slurm you can do srun --nodes=${nnodes} --gpus-per-node=${ngpus_per_node} fairseq-hydra-train --args.

Unfortunately, I don't think I have slurm installed on our cluster nor do I have a root privilege to configure it. Are there any other startup methods e.g. using torchrun or something that can work with hydra-train?

I think it should be similar as running usual pytorch multi-node applications:

torchrun --nnodes=${nnode}
    --nproc_per_node=${nproc_per_node}
    --rdzv_id=$JOB_ID
    --rdzv_backend=c10d
    --rdzv_endpoint=$HOST_NODE_ADDR
    fairseq-hydra-train --args

, where you need to specify other arguments like HOST_NODE_ADDR. Btw, when you override the distributed_training arguments in fairseq:

  +override.distributed_training.distributed_world_size=${world_size} \
  +override.distributed_training.nprocs_per_node=${nproc_per_node} \
  +override.distributed_training.distributed_port=12356 \

It should be:

  distributed_training.distributed_world_size=${world_size} \
  distributed_training.nprocs_per_node=${nproc_per_node} \
  distributed_training.distributed_port=12356 \

Thank you for the reply. It's very nice of you! I'll try again tomorrow. I thought there should be +override.*** when the argument already exists in the yaml, and without +override when it does not (as you suggested in another issue), was I wrong?

On Wed, Feb 16, 2022, 00:24 chevalierNoir @.***> wrote:

I think it should be similar as running usual pytorch multi-node applications https://pytorch.org/docs/stable/elastic/run.html:

torchrun --nnodes=${nnode} --nproc_per_node=${nproc_per_node} --rdzv_id=$JOB_ID --rdzv_backend=c10d --rdzv_endpoint=$HOST_NODE_ADDR fairseq-hydra-train --args

, where you need to specify other arguments like HOST_NODE_ADDR. Btw, when you override the distributed_training arguments in fairseq:

+override.distributed_training.distributed_world_size=${world_size} \ +override.distributed_training.nprocs_per_node=${nproc_per_node} \ +override.distributed_training.distributed_port=12356 \

It should be:

distributed_training.distributed_world_size=${world_size} \ distributed_training.nprocs_per_node=${nproc_per_node} \ distributed_training.distributed_port=12356 \

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/av_hubert/issues/19#issuecomment-1040483009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKSICDVGJXCIU4O7XVCQR4TU3J445ANCNFSM5OL3YMAA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

If key is in yaml, just dokey=... in the command line. If key is not in the yaml, use +key=....

override is one key we added in the decoding config, which is only used at test time.

Clear to me now. Thanks again for the clarification.👍 Will try out distributed training again tmr hopefully it will work.

On Wed, Feb 16, 2022, 00:56 chevalierNoir @.***> wrote:

If key is in yaml, just dokey=... in the command line. If key is not in the yaml, use +key=....

override is one key we added in the decoding config https://github.com/facebookresearch/av_hubert/blob/main/avhubert/conf/s2s_decode.yaml, which is only used at test time.

— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/av_hubert/issues/19#issuecomment-1040523869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKSICDWRJMR4AMLUUXLRTQLU3KAUXANCNFSM5OL3YMAA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

Really frustrating, I've been working on this for a whole day and I just couldn't make it right. :-< Here is what I do (I wrote the port number 12356 in YAML)

torchrun --nnodes=$nnodes --nproc_per_node=$nproc_per_node \
  --rdzv_backend=c10d --rdzv_endpoint=${master_addr}:12356 --rdzv_id=$node_rank \
  $(which fairseq-hydra-train) \
  --config-dir ./conf/finetune/ --config-name base_lrs3_30h.yaml \
  task.data=${data_dir} task.label_dir=${data_dir} \
  task.tokenizer_bpe_model=/gs/hs0/tga-tslab/bowen/Dataset/LRS3/spm1000/spm_unigram1000.model \
  model.w2v_path=/gs/hs0/tga-tslab/bowen/av_hubert/pretrain_model/base_lrs3_iter5.pt \
  hydra.run.dir=./exps/finetune/dist common.user_dir=`pwd` \
  distributed_training.distributed_world_size=${world_size} \
  distributed_training.nprocs_per_node=${nproc_per_node} \
  optimization.update_freq=[${update_freq}]

and also adding a line cfg.distributed_training.device_id = int(os.environ["LOCAL_RANK"]) to distributed/utils.py -> call_main() as the project can no longer accept --local_rank from torch.distributed.launch. (turns out same error occurs regardless this line)

And then, this is what I got for the master node:

ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{
  "message": {
    "message": "RendezvousTimeoutError: ",
    "extraInfo": {
      "py_callstack": "Traceback (most recent call last):\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 345, in wrapper\n    return f(*args, **kwargs)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py\", line 719, in main\n    run(args)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py\", line 710, in run\n    elastic_launch(\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 131, in __call__\n    return launch_agent(self._config, self._entrypoint, list(args))\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 252, in launch_agent\n    result = agent.run()\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 709, in run\n    result = self._invoke_run(role)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 837, in _invoke_run\n    self._initialize_workers(self._worker_group)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 678, in _initialize_workers\n    self._rendezvous(worker_group)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 538, in _rendezvous\n    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 1024, in next_rendezvous\n    self._op_executor.run(join_op, deadline)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 637, in run\n    raise RendezvousTimeoutError()\ntorch.distributed.elastic.rendezvous.api.RendezvousTimeoutError\n",
      "timestamp": "1645101017"
    }
  }
}
Traceback (most recent call last):
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
    result = agent.run()
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 837, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
    self._rendezvous(worker_group)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 538, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1024, in next_rendezvous
    self._op_executor.run(join_op, deadline)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 637, in run
    raise RendezvousTimeoutError()
torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError
Finished

and this is for the slave node:

WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node '******' has failed to shutdown the rendezvous '1' due to an error of type RendezvousConnectionError.
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{
  "message": {
    "message": "RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.",
    "extraInfo": {
      "py_callstack": "Traceback (most recent call last):\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 113, in _call_store\n    return getattr(self._store, store_op)(*args, **kwargs)\nRuntimeError: Connection reset by peer\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 345, in wrapper\n    return f(*args, **kwargs)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py\", line 719, in main\n    run(args)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py\", line 710, in run\n    elastic_launch(\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 131, in __call__\n    return launch_agent(self._config, self._entrypoint, list(args))\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 252, in launch_agent\n    result = agent.run()\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 709, in run\n    result = self._invoke_run(role)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 837, in _invoke_run\n    self._initialize_workers(self._worker_group)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 678, in _initialize_workers\n    self._rendezvous(worker_group)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n    result = f(*args, **kwargs)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 538, in _rendezvous\n    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 1024, in next_rendezvous\n    self._op_executor.run(join_op, deadline)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 606, in run\n    has_set = self._state_holder.sync()\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 408, in sync\n    get_response = self._backend.get_state()\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 73, in get_state\n    base64_state: bytes = self._call_store(\"get\", self._key)\n  File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 115, in _call_store\n    raise RendezvousConnectionError(\ntorch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.\n",
      "timestamp": "1645101018"
    }
  }
}
Traceback (most recent call last):
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
    return f(*args, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
    run(args)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
    result = agent.run()
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 837, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
    self._rendezvous(worker_group)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 538, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1024, in next_rendezvous
    self._op_executor.run(join_op, deadline)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 606, in run
    has_set = self._state_holder.sync()
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 408, in sync
    get_response = self._backend.get_state()
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
    base64_state: bytes = self._call_store("get", self._key)
  File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

I googled every relevant question but still didn't get a clear solution. Now I'm not sure where to go next. >_<

Hi,

Several things here: 1. rdzv_id should be set to the job id, which is shared by all nodes 2. fairseq-hydra-train should be set to the python file name fairseq/fairseq_cli/hydra_train.py.

Btw, I don't think you need to change anything in distributed/utils.py.

I tested a multi-node setup using a single machine with two gpus, and below is how I ran:

Master:

CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=2 --nproc_per_node=1 --rdzv_id=avhubert --rdzv_backend=c10d --rdzv_endpoint="127.0.0.1:12345" `pwd`/../fairseq/fairseq_cli/hydra_train.py --args

Slave:

CUDA_VISIBLE_DEVICES=1 torchrun --nnodes=2 --nproc_per_node=1 --rdzv_id=avhubert --rdzv_backend=c10d --rdzv_endpoint="127.0.0.1:12345" `pwd`/../fairseq/fairseq_cli/hydra_train.py --args

rdzv_endpoint should be changed accordingly in your case.

Yeah, the rdzv_id was the cause for that error, which should be the same for all nodes, I should've read the docs more carefully. But I think this line cfg.distributed_training.device_id = int(os.environ["LOCAL_RANK"]) is necessary when using torchrun, without it, the device_id will always be 0, resulting in multiple processes being assigned to the same device. (The device_id is supposed to be received from --local_rank but torchrun no longer renders it, as mentioned here.

Below is what happens if not read local rank from os.environ. (I think it worked in your test case because you have only one process for each node and also specified CUDA_VISIBLE_DEVICES=1 for the second.)

RuntimeError: work = default_pg.allreduce([tensor], opts) NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc)

However, still several things here. torchrun always somehow misjudges the master and the slave, initializing the slave node as rank 0,1,2,3 and master as 4,5,6,7, finally leading to

[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

I kinda gave up using torchrun but let fairseq spawns the process, to this end I just launch by

fairseq-hydra-train \
  --config-dir ./conf/finetune/ --config-name base_lrs3_30h.yaml \
  task.data=${data_dir} task.label_dir=${data_dir} \
  task.tokenizer_bpe_model=/gs/hs0/tga-tslab/bowen/Dataset/LRS3/spm1000/spm_unigram1000.model \
  model.w2v_path=/gs/hs0/tga-tslab/bowen/av_hubert/pretrain_model/base_lrs3_iter5.pt \
  hydra.run.dir=./exps/finetune/dist common.user_dir=`pwd` \
  distributed_training.distributed_world_size=${world_size} \
  distributed_training.nprocs_per_node=${nproc_per_node} \
  optimization.update_freq=[${update_freq}] \
  +distributed_training.distributed_init_method="tcp://${master_addr}:12356" \
  distributed_training.distributed_rank=$((node_rank * 4))

and finally all processes communicated successfully. In this case the added line should be removed as the local ranks are automatically assigned.

As I'm feeling like being very close to success, I got stuck...
After printing the following, no further messages printed, processes hang.

[2022-02-18 23:06:35,809][fairseq.utils][INFO] - ***********************CUDA enviroments for all 8 workers***********************
[2022-02-18 23:06:35,809][fairseq.utils][INFO] - rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,809][fairseq.utils][INFO] - rank   1: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   2: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   3: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   4: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   5: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   6: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   7: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - ***********************CUDA enviroments for all 8 workers***********************
[2022-02-18 23:06:35,810][fairseq_cli.train][INFO] - training on 8 devices (GPUs/TPUs)
[2022-02-18 23:06:35,811][fairseq_cli.train][INFO] - max tokens per device = 1000 and max sentences per device = None
[2022-02-18 23:06:35,812][fairseq.trainer][INFO] - Preparing to load checkpoint checkpoints/checkpoint_last.pt
[2022-02-18 23:06:35,812][fairseq.trainer][INFO] - No existing checkpoint found checkpoints/checkpoint_last.pt
[2022-02-18 23:06:35,812][fairseq.trainer][INFO] - loading train data for epoch 1
[2022-02-18 23:06:35,813][avhubert.hubert_pretraining][INFO] - Using tokenizer
[2022-02-18 23:06:36,097][avhubert.hubert_dataset][INFO] - max_keep=500, min_keep=None, loaded 30782, skipped 0 short and 0 long and 0 unaligned, longest-loaded=155, shortest-loaded=12
[2022-02-18 23:06:36,112][avhubert.hubert_dataset][INFO] - /gs/hs0/tga-tslab/bowen/Dataset/LRS3/30h_data/train.wrd is sequence label. skipped
[2022-02-18 23:06:36,112][avhubert.hubert_dataset][INFO] - image transform: Compose(
    Normalize(mean=0.0, std=255.0)
    RandomCrop(size=(88, 88))
    <avhubert.utils.HorizontalFlip object at 0x2aadf80fcfd0>
    Normalize(mean=0.421, std=0.165)
)
[2022-02-18 23:06:36,113][avhubert.hubert_dataset][INFO] - pad_audio=True, random_crop=False, normalize=True, max_sample_size=500, seqs2seq data=True,
[2022-02-18 23:06:36,113][avhubert.hubert_dataset][INFO] - Noise wav: None->0 wav, Prob: 0.0, SNR: 0, Number of mixture: 1
[2022-02-18 23:06:38,257][fairseq.trainer][INFO] - begin training epoch 1
[2022-02-18 23:06:38,258][fairseq_cli.train][INFO] - Start iterating over samples

Do you have any suggestion, my hero @chevalierNoir

This may be an issue related to pytorch. I suggest running a toy example of pytorch distributed data parallel like the one here using multiple nodes to check whether it works.

RuntimeError: work = default_pg.allreduce([tensor], opts) NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc)

However, still several things here. torchrun always somehow misjudges the master and the slave, initializing the slave node as rank 0,1,2,3 and master as 4,5,6,7, finally leading to

[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
  what():  NCCL error: unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

I kinda gave up using torchrun but let fairseq spawns the process, to this end I just launch by

fairseq-hydra-train \
  --config-dir ./conf/finetune/ --config-name base_lrs3_30h.yaml \
  task.data=${data_dir} task.label_dir=${data_dir} \
  task.tokenizer_bpe_model=/gs/hs0/tga-tslab/bowen/Dataset/LRS3/spm1000/spm_unigram1000.model \
  model.w2v_path=/gs/hs0/tga-tslab/bowen/av_hubert/pretrain_model/base_lrs3_iter5.pt \
  hydra.run.dir=./exps/finetune/dist common.user_dir=`pwd` \
  distributed_training.distributed_world_size=${world_size} \
  distributed_training.nprocs_per_node=${nproc_per_node} \
  optimization.update_freq=[${update_freq}] \
  +distributed_training.distributed_init_method="tcp://${master_addr}:12356" \
  distributed_training.distributed_rank=$((node_rank * 4))

and finally all processes communicated successfully. In this case the added line should be removed as the local ranks are automatically assigned.

As I'm feeling like being very close to success, I got stuck... After printing the following, no further messages printed, processes hang.

[2022-02-18 23:06:35,809][fairseq.utils][INFO] - ***********************CUDA enviroments for all 8 workers***********************
[2022-02-18 23:06:35,809][fairseq.utils][INFO] - rank   0: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,809][fairseq.utils][INFO] - rank   1: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   2: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   3: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   4: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   5: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   6: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank   7: capabilities =  6.0  ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB                    
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - ***********************CUDA enviroments for all 8 workers***********************
[2022-02-18 23:06:35,810][fairseq_cli.train][INFO] - training on 8 devices (GPUs/TPUs)
[2022-02-18 23:06:35,811][fairseq_cli.train][INFO] - max tokens per device = 1000 and max sentences per device = None
[2022-02-18 23:06:35,812][fairseq.trainer][INFO] - Preparing to load checkpoint checkpoints/checkpoint_last.pt
[2022-02-18 23:06:35,812][fairseq.trainer][INFO] - No existing checkpoint found checkpoints/checkpoint_last.pt
[2022-02-18 23:06:35,812][fairseq.trainer][INFO] - loading train data for epoch 1
[2022-02-18 23:06:35,813][avhubert.hubert_pretraining][INFO] - Using tokenizer
[2022-02-18 23:06:36,097][avhubert.hubert_dataset][INFO] - max_keep=500, min_keep=None, loaded 30782, skipped 0 short and 0 long and 0 unaligned, longest-loaded=155, shortest-loaded=12
[2022-02-18 23:06:36,112][avhubert.hubert_dataset][INFO] - /gs/hs0/tga-tslab/bowen/Dataset/LRS3/30h_data/train.wrd is sequence label. skipped
[2022-02-18 23:06:36,112][avhubert.hubert_dataset][INFO] - image transform: Compose(
    Normalize(mean=0.0, std=255.0)
    RandomCrop(size=(88, 88))
    <avhubert.utils.HorizontalFlip object at 0x2aadf80fcfd0>
    Normalize(mean=0.421, std=0.165)
)
[2022-02-18 23:06:36,113][avhubert.hubert_dataset][INFO] - pad_audio=True, random_crop=False, normalize=True, max_sample_size=500, seqs2seq data=True,
[2022-02-18 23:06:36,113][avhubert.hubert_dataset][INFO] - Noise wav: None->0 wav, Prob: 0.0, SNR: 0, Number of mixture: 1
[2022-02-18 23:06:38,257][fairseq.trainer][INFO] - begin training epoch 1
[2022-02-18 23:06:38,258][fairseq_cli.train][INFO] - Start iterating over samples

Do you have any suggestion, my hero @chevalierNoir

Was this problem solved? I am having the same issue actually?

Hi guys! I succeed to use 2 4XGPU nodes with fairseq-hydra-train. Here's how I start the job:

python -m torch.distributed.launch --use_env --nproc_per_node=4 --nnodes=2 --node_rank=0--master_addr=${ip_addr} --master_port=29500 \
    $(which fairseq-hydra-train) --config-dir examples/roberta/config/pretraining --config-name base \
    task.data=${data_dir} checkpoint.save_dir=${output_dir}

Hope it will be useful for anyone who is struggling in searching for the answer.

fairseq-hydra-train \ --config-dir ./conf/finetune/ --config-name base_lrs3_30h.yaml \ task.data=${data_dir} task.label_dir=${data_dir} \ task.tokenizer_bpe_model=/gs/hs0/tga-tslab/bowen/Dataset/LRS3/spm1000/spm_unigram1000.model \ model.w2v_path=/gs/hs0/tga-tslab/bowen/av_hubert/pretrain_model/base_lrs3_iter5.pt \ hydra.run.dir=./exps/finetune/dist common.user_dir=pwd \ distributed_training.distributed_world_size=${world_size} \ distributed_training.nprocs_per_node=${nproc_per_node} \ optimization.update_freq=[${update_freq}] \ +distributed_training.distributed_init_method="tcp://${master_addr}:12356" \ distributed_training.distributed_rank=$((node_rank * 4))

A variant of this worked for me for training hubert:

fairseq-hydra-train \
  --config-dir ./pretrain --config-name hubert_base_2gpu\
  task.data=${root}/data/tsv task.label_dir=${root}/data/labels \
  task.labels='["km"]' \
  model.label_rate=100 \
  distributed_training.distributed_world_size=${WORLD_SIZE} \
  distributed_training.nprocs_per_node=${NPROC_PER_NODE} \
  +distributed_training.distributed_init_method="tcp://${MASTER_ADDR}:${MASTER_PORT}" \
  distributed_training.distributed_rank=$((node_rank * 2))

Thank you @18445864529

@weiyx16 Does it work well if I use 2 GPUs on a single machine (server)? I use --nproc_per_node=2 --nnodes=2 for my training and the process is stopped here without any further progress:

/home/sungwoo/anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See 
https://pytorch.org/docs/stable/distributed.html#launch-utility for 
further instructions

  warnings.warn(
[2024-04-09 18:26:30,963] torch.distributed.run: [WARNING] 
[2024-04-09 18:26:30,963] torch.distributed.run: [WARNING] *****************************************
[2024-04-09 18:26:30,963] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-04-09 18:26:30,963] torch.distributed.run: [WARNING] *****************************************

facebookresearch / av_hubert

fairseq-hydra-train with multi-nodes distributed training #19