Open 18445864529 opened 2 years ago
Hi,
On slurm you can do srun --nodes=${nnodes} --gpus-per-node=${ngpus_per_node} fairseq-hydra-train --args
.
Unfortunately, I don't think I have slurm installed on our cluster nor do I have a root privilege to configure it. Are there any other startup methods e.g. using torchrun or something that can work with hydra-train?
I think it should be similar as running usual pytorch multi-node applications:
torchrun --nnodes=${nnode}
--nproc_per_node=${nproc_per_node}
--rdzv_id=$JOB_ID
--rdzv_backend=c10d
--rdzv_endpoint=$HOST_NODE_ADDR
fairseq-hydra-train --args
, where you need to specify other arguments like HOST_NODE_ADDR
.
Btw, when you override the distributed_training arguments in fairseq:
+override.distributed_training.distributed_world_size=${world_size} \
+override.distributed_training.nprocs_per_node=${nproc_per_node} \
+override.distributed_training.distributed_port=12356 \
It should be:
distributed_training.distributed_world_size=${world_size} \
distributed_training.nprocs_per_node=${nproc_per_node} \
distributed_training.distributed_port=12356 \
Thank you for the reply. It's very nice of you! I'll try again tomorrow. I thought there should be +override.*** when the argument already exists in the yaml, and without +override when it does not (as you suggested in another issue), was I wrong?
On Wed, Feb 16, 2022, 00:24 chevalierNoir @.***> wrote:
I think it should be similar as running usual pytorch multi-node applications https://pytorch.org/docs/stable/elastic/run.html:
torchrun --nnodes=${nnode} --nproc_per_node=${nproc_per_node} --rdzv_id=$JOB_ID --rdzv_backend=c10d --rdzv_endpoint=$HOST_NODE_ADDR fairseq-hydra-train --args
, where you need to specify other arguments like HOST_NODE_ADDR. Btw, when you override the distributed_training arguments in fairseq:
+override.distributed_training.distributed_world_size=${world_size} \ +override.distributed_training.nprocs_per_node=${nproc_per_node} \ +override.distributed_training.distributed_port=12356 \
It should be:
distributed_training.distributed_world_size=${world_size} \ distributed_training.nprocs_per_node=${nproc_per_node} \ distributed_training.distributed_port=12356 \
— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/av_hubert/issues/19#issuecomment-1040483009, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKSICDVGJXCIU4O7XVCQR4TU3J445ANCNFSM5OL3YMAA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you authored the thread.Message ID: @.***>
If key
is in yaml, just dokey=...
in the command line. If key
is not in the yaml, use +key=...
.
override
is one key we added in the decoding config, which is only used at test time.
Clear to me now. Thanks again for the clarification.👍 Will try out distributed training again tmr hopefully it will work.
On Wed, Feb 16, 2022, 00:56 chevalierNoir @.***> wrote:
If key is in yaml, just dokey=... in the command line. If key is not in the yaml, use +key=....
override is one key we added in the decoding config https://github.com/facebookresearch/av_hubert/blob/main/avhubert/conf/s2s_decode.yaml, which is only used at test time.
— Reply to this email directly, view it on GitHub https://github.com/facebookresearch/av_hubert/issues/19#issuecomment-1040523869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKSICDWRJMR4AMLUUXLRTQLU3KAUXANCNFSM5OL3YMAA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
You are receiving this because you authored the thread.Message ID: @.***>
Really frustrating, I've been working on this for a whole day and I just couldn't make it right. :-< Here is what I do (I wrote the port number 12356 in YAML)
torchrun --nnodes=$nnodes --nproc_per_node=$nproc_per_node \
--rdzv_backend=c10d --rdzv_endpoint=${master_addr}:12356 --rdzv_id=$node_rank \
$(which fairseq-hydra-train) \
--config-dir ./conf/finetune/ --config-name base_lrs3_30h.yaml \
task.data=${data_dir} task.label_dir=${data_dir} \
task.tokenizer_bpe_model=/gs/hs0/tga-tslab/bowen/Dataset/LRS3/spm1000/spm_unigram1000.model \
model.w2v_path=/gs/hs0/tga-tslab/bowen/av_hubert/pretrain_model/base_lrs3_iter5.pt \
hydra.run.dir=./exps/finetune/dist common.user_dir=`pwd` \
distributed_training.distributed_world_size=${world_size} \
distributed_training.nprocs_per_node=${nproc_per_node} \
optimization.update_freq=[${update_freq}]
and also adding a line cfg.distributed_training.device_id = int(os.environ["LOCAL_RANK"])
to distributed/utils.py -> call_main()
as the project can no longer accept --local_rank
from torch.distributed.launch
. (turns out same error occurs regardless this line)
And then, this is what I got for the master node:
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{
"message": {
"message": "RendezvousTimeoutError: ",
"extraInfo": {
"py_callstack": "Traceback (most recent call last):\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 345, in wrapper\n return f(*args, **kwargs)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py\", line 719, in main\n run(args)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py\", line 710, in run\n elastic_launch(\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 131, in __call__\n return launch_agent(self._config, self._entrypoint, list(args))\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 252, in launch_agent\n result = agent.run()\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 709, in run\n result = self._invoke_run(role)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 837, in _invoke_run\n self._initialize_workers(self._worker_group)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 678, in _initialize_workers\n self._rendezvous(worker_group)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 538, in _rendezvous\n store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 1024, in next_rendezvous\n self._op_executor.run(join_op, deadline)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 637, in run\n raise RendezvousTimeoutError()\ntorch.distributed.elastic.rendezvous.api.RendezvousTimeoutError\n",
"timestamp": "1645101017"
}
}
}
Traceback (most recent call last):
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 837, in _invoke_run
self._initialize_workers(self._worker_group)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
self._rendezvous(worker_group)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 538, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1024, in next_rendezvous
self._op_executor.run(join_op, deadline)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 637, in run
raise RendezvousTimeoutError()
torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError
Finished
and this is for the slave node:
WARNING:torch.distributed.elastic.rendezvous.dynamic_rendezvous:The node '******' has failed to shutdown the rendezvous '1' due to an error of type RendezvousConnectionError.
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{
"message": {
"message": "RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.",
"extraInfo": {
"py_callstack": "Traceback (most recent call last):\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 113, in _call_store\n return getattr(self._store, store_op)(*args, **kwargs)\nRuntimeError: Connection reset by peer\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 345, in wrapper\n return f(*args, **kwargs)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py\", line 719, in main\n run(args)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py\", line 710, in run\n elastic_launch(\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 131, in __call__\n return launch_agent(self._config, self._entrypoint, list(args))\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 252, in launch_agent\n result = agent.run()\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 709, in run\n result = self._invoke_run(role)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 837, in _invoke_run\n self._initialize_workers(self._worker_group)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 678, in _initialize_workers\n self._rendezvous(worker_group)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py\", line 125, in wrapper\n result = f(*args, **kwargs)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py\", line 538, in _rendezvous\n store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 1024, in next_rendezvous\n self._op_executor.run(join_op, deadline)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 606, in run\n has_set = self._state_holder.sync()\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py\", line 408, in sync\n get_response = self._backend.get_state()\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 73, in get_state\n base64_state: bytes = self._call_store(\"get\", self._key)\n File \"/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 115, in _call_store\n raise RendezvousConnectionError(\ntorch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.\n",
"timestamp": "1645101018"
}
}
}
Traceback (most recent call last):
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 113, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper
return f(*args, **kwargs)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main
run(args)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run
elastic_launch(
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 252, in launch_agent
result = agent.run()
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
result = self._invoke_run(role)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 837, in _invoke_run
self._initialize_workers(self._worker_group)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
self._rendezvous(worker_group)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 538, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1024, in next_rendezvous
self._op_executor.run(join_op, deadline)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 606, in run
has_set = self._state_holder.sync()
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 408, in sync
get_response = self._backend.get_state()
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 73, in get_state
base64_state: bytes = self._call_store("get", self._key)
File "/gs/hs0/tga-tslab/bowen/Anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 115, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
I googled every relevant question but still didn't get a clear solution. Now I'm not sure where to go next. >_<
Hi,
Several things here: 1. rdzv_id
should be set to the job id, which is shared by all nodes 2. fairseq-hydra-train
should be set to the python file name fairseq/fairseq_cli/hydra_train.py
.
Btw, I don't think you need to change anything in distributed/utils.py
.
I tested a multi-node setup using a single machine with two gpus, and below is how I ran:
Master:
CUDA_VISIBLE_DEVICES=0 torchrun --nnodes=2 --nproc_per_node=1 --rdzv_id=avhubert --rdzv_backend=c10d --rdzv_endpoint="127.0.0.1:12345" `pwd`/../fairseq/fairseq_cli/hydra_train.py --args
Slave:
CUDA_VISIBLE_DEVICES=1 torchrun --nnodes=2 --nproc_per_node=1 --rdzv_id=avhubert --rdzv_backend=c10d --rdzv_endpoint="127.0.0.1:12345" `pwd`/../fairseq/fairseq_cli/hydra_train.py --args
rdzv_endpoint
should be changed accordingly in your case.
Yeah, the rdzv_id
was the cause for that error, which should be the same for all nodes, I should've read the docs more carefully.
But I think this line cfg.distributed_training.device_id = int(os.environ["LOCAL_RANK"])
is necessary when using torchrun, without it, the device_id
will always be 0, resulting in multiple processes being assigned to the same device. (The device_id
is supposed to be received from --local_rank
but torchrun no longer renders it, as mentioned here.
Below is what happens if not read local rank from os.environ. (I think it worked in your test case because you have only one process for each node and also specified CUDA_VISIBLE_DEVICES=1
for the second.)
RuntimeError: work = default_pg.allreduce([tensor], opts) NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3
ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc)
However, still several things here. torchrun always somehow misjudges the master and the slave, initializing the slave node as rank 0,1,2,3 and master as 4,5,6,7, finally leading to
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): NCCL error: unhandled system error, NCCL version 21.0.3
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
I kinda gave up using torchrun but let fairseq spawns the process, to this end I just launch by
fairseq-hydra-train \
--config-dir ./conf/finetune/ --config-name base_lrs3_30h.yaml \
task.data=${data_dir} task.label_dir=${data_dir} \
task.tokenizer_bpe_model=/gs/hs0/tga-tslab/bowen/Dataset/LRS3/spm1000/spm_unigram1000.model \
model.w2v_path=/gs/hs0/tga-tslab/bowen/av_hubert/pretrain_model/base_lrs3_iter5.pt \
hydra.run.dir=./exps/finetune/dist common.user_dir=`pwd` \
distributed_training.distributed_world_size=${world_size} \
distributed_training.nprocs_per_node=${nproc_per_node} \
optimization.update_freq=[${update_freq}] \
+distributed_training.distributed_init_method="tcp://${master_addr}:12356" \
distributed_training.distributed_rank=$((node_rank * 4))
and finally all processes communicated successfully. In this case the added line should be removed as the local ranks are automatically assigned.
As I'm feeling like being very close to success, I got stuck...
After printing the following, no further messages printed, processes hang.
[2022-02-18 23:06:35,809][fairseq.utils][INFO] - ***********************CUDA enviroments for all 8 workers***********************
[2022-02-18 23:06:35,809][fairseq.utils][INFO] - rank 0: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB
[2022-02-18 23:06:35,809][fairseq.utils][INFO] - rank 1: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 2: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 3: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 4: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 5: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 6: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 7: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB
[2022-02-18 23:06:35,810][fairseq.utils][INFO] - ***********************CUDA enviroments for all 8 workers***********************
[2022-02-18 23:06:35,810][fairseq_cli.train][INFO] - training on 8 devices (GPUs/TPUs)
[2022-02-18 23:06:35,811][fairseq_cli.train][INFO] - max tokens per device = 1000 and max sentences per device = None
[2022-02-18 23:06:35,812][fairseq.trainer][INFO] - Preparing to load checkpoint checkpoints/checkpoint_last.pt
[2022-02-18 23:06:35,812][fairseq.trainer][INFO] - No existing checkpoint found checkpoints/checkpoint_last.pt
[2022-02-18 23:06:35,812][fairseq.trainer][INFO] - loading train data for epoch 1
[2022-02-18 23:06:35,813][avhubert.hubert_pretraining][INFO] - Using tokenizer
[2022-02-18 23:06:36,097][avhubert.hubert_dataset][INFO] - max_keep=500, min_keep=None, loaded 30782, skipped 0 short and 0 long and 0 unaligned, longest-loaded=155, shortest-loaded=12
[2022-02-18 23:06:36,112][avhubert.hubert_dataset][INFO] - /gs/hs0/tga-tslab/bowen/Dataset/LRS3/30h_data/train.wrd is sequence label. skipped
[2022-02-18 23:06:36,112][avhubert.hubert_dataset][INFO] - image transform: Compose(
Normalize(mean=0.0, std=255.0)
RandomCrop(size=(88, 88))
<avhubert.utils.HorizontalFlip object at 0x2aadf80fcfd0>
Normalize(mean=0.421, std=0.165)
)
[2022-02-18 23:06:36,113][avhubert.hubert_dataset][INFO] - pad_audio=True, random_crop=False, normalize=True, max_sample_size=500, seqs2seq data=True,
[2022-02-18 23:06:36,113][avhubert.hubert_dataset][INFO] - Noise wav: None->0 wav, Prob: 0.0, SNR: 0, Number of mixture: 1
[2022-02-18 23:06:38,257][fairseq.trainer][INFO] - begin training epoch 1
[2022-02-18 23:06:38,258][fairseq_cli.train][INFO] - Start iterating over samples
Do you have any suggestion, my hero @chevalierNoir
This may be an issue related to pytorch. I suggest running a toy example of pytorch distributed data parallel like the one here using multiple nodes to check whether it works.
Yeah, the
rdzv_id
was the cause for that error, which should be the same for all nodes, I should've read the docs more carefully. But I think this linecfg.distributed_training.device_id = int(os.environ["LOCAL_RANK"])
is necessary when using torchrun, without it, thedevice_id
will always be 0, resulting in multiple processes being assigned to the same device. (Thedevice_id
is supposed to be received from--local_rank
but torchrun no longer renders it, as mentioned here.Below is what happens if not read local rank from os.environ. (I think it worked in your test case because you have only one process for each node and also specified
CUDA_VISIBLE_DEVICES=1
for the second.)RuntimeError: work = default_pg.allreduce([tensor], opts) NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:957, invalid usage, NCCL version 21.0.3 ncclInvalidUsage: This usually reflects invalid usage of NCCL library (such as too many async ops, too many collectives at once, mixing streams in a group, etc)
However, still several things here. torchrun always somehow misjudges the master and the slave, initializing the slave node as rank 0,1,2,3 and master as 4,5,6,7, finally leading to
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down. terminate called after throwing an instance of 'std::runtime_error' what(): NCCL error: unhandled system error, NCCL version 21.0.3 ncclSystemError: System call (socket, malloc, munmap, etc) failed.
I kinda gave up using torchrun but let fairseq spawns the process, to this end I just launch by
fairseq-hydra-train \ --config-dir ./conf/finetune/ --config-name base_lrs3_30h.yaml \ task.data=${data_dir} task.label_dir=${data_dir} \ task.tokenizer_bpe_model=/gs/hs0/tga-tslab/bowen/Dataset/LRS3/spm1000/spm_unigram1000.model \ model.w2v_path=/gs/hs0/tga-tslab/bowen/av_hubert/pretrain_model/base_lrs3_iter5.pt \ hydra.run.dir=./exps/finetune/dist common.user_dir=`pwd` \ distributed_training.distributed_world_size=${world_size} \ distributed_training.nprocs_per_node=${nproc_per_node} \ optimization.update_freq=[${update_freq}] \ +distributed_training.distributed_init_method="tcp://${master_addr}:12356" \ distributed_training.distributed_rank=$((node_rank * 4))
and finally all processes communicated successfully. In this case the added line should be removed as the local ranks are automatically assigned.
As I'm feeling like being very close to success, I got stuck... After printing the following, no further messages printed, processes hang.
[2022-02-18 23:06:35,809][fairseq.utils][INFO] - ***********************CUDA enviroments for all 8 workers*********************** [2022-02-18 23:06:35,809][fairseq.utils][INFO] - rank 0: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB [2022-02-18 23:06:35,809][fairseq.utils][INFO] - rank 1: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB [2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 2: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB [2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 3: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB [2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 4: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB [2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 5: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB [2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 6: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB [2022-02-18 23:06:35,810][fairseq.utils][INFO] - rank 7: capabilities = 6.0 ; total memory = 15.899 GB ; name = Tesla P100-SXM2-16GB [2022-02-18 23:06:35,810][fairseq.utils][INFO] - ***********************CUDA enviroments for all 8 workers*********************** [2022-02-18 23:06:35,810][fairseq_cli.train][INFO] - training on 8 devices (GPUs/TPUs) [2022-02-18 23:06:35,811][fairseq_cli.train][INFO] - max tokens per device = 1000 and max sentences per device = None [2022-02-18 23:06:35,812][fairseq.trainer][INFO] - Preparing to load checkpoint checkpoints/checkpoint_last.pt [2022-02-18 23:06:35,812][fairseq.trainer][INFO] - No existing checkpoint found checkpoints/checkpoint_last.pt [2022-02-18 23:06:35,812][fairseq.trainer][INFO] - loading train data for epoch 1 [2022-02-18 23:06:35,813][avhubert.hubert_pretraining][INFO] - Using tokenizer [2022-02-18 23:06:36,097][avhubert.hubert_dataset][INFO] - max_keep=500, min_keep=None, loaded 30782, skipped 0 short and 0 long and 0 unaligned, longest-loaded=155, shortest-loaded=12 [2022-02-18 23:06:36,112][avhubert.hubert_dataset][INFO] - /gs/hs0/tga-tslab/bowen/Dataset/LRS3/30h_data/train.wrd is sequence label. skipped [2022-02-18 23:06:36,112][avhubert.hubert_dataset][INFO] - image transform: Compose( Normalize(mean=0.0, std=255.0) RandomCrop(size=(88, 88)) <avhubert.utils.HorizontalFlip object at 0x2aadf80fcfd0> Normalize(mean=0.421, std=0.165) ) [2022-02-18 23:06:36,113][avhubert.hubert_dataset][INFO] - pad_audio=True, random_crop=False, normalize=True, max_sample_size=500, seqs2seq data=True, [2022-02-18 23:06:36,113][avhubert.hubert_dataset][INFO] - Noise wav: None->0 wav, Prob: 0.0, SNR: 0, Number of mixture: 1 [2022-02-18 23:06:38,257][fairseq.trainer][INFO] - begin training epoch 1 [2022-02-18 23:06:38,258][fairseq_cli.train][INFO] - Start iterating over samples
Do you have any suggestion, my hero @chevalierNoir
Was this problem solved? I am having the same issue actually?
Hi guys! I succeed to use 2 4XGPU nodes with fairseq-hydra-train. Here's how I start the job:
python -m torch.distributed.launch --use_env --nproc_per_node=4 --nnodes=2 --node_rank=0--master_addr=${ip_addr} --master_port=29500 \
$(which fairseq-hydra-train) --config-dir examples/roberta/config/pretraining --config-name base \
task.data=${data_dir} checkpoint.save_dir=${output_dir}
Hope it will be useful for anyone who is struggling in searching for the answer.
fairseq-hydra-train \ --config-dir ./conf/finetune/ --config-name base_lrs3_30h.yaml \ task.data=${data_dir} task.label_dir=${data_dir} \ task.tokenizer_bpe_model=/gs/hs0/tga-tslab/bowen/Dataset/LRS3/spm1000/spm_unigram1000.model \ model.w2v_path=/gs/hs0/tga-tslab/bowen/av_hubert/pretrain_model/base_lrs3_iter5.pt \ hydra.run.dir=./exps/finetune/dist common.user_dir=
pwd
\ distributed_training.distributed_world_size=${world_size} \ distributed_training.nprocs_per_node=${nproc_per_node} \ optimization.update_freq=[${update_freq}] \ +distributed_training.distributed_init_method="tcp://${master_addr}:12356" \ distributed_training.distributed_rank=$((node_rank * 4))
A variant of this worked for me for training hubert:
fairseq-hydra-train \
--config-dir ./pretrain --config-name hubert_base_2gpu\
task.data=${root}/data/tsv task.label_dir=${root}/data/labels \
task.labels='["km"]' \
model.label_rate=100 \
distributed_training.distributed_world_size=${WORLD_SIZE} \
distributed_training.nprocs_per_node=${NPROC_PER_NODE} \
+distributed_training.distributed_init_method="tcp://${MASTER_ADDR}:${MASTER_PORT}" \
distributed_training.distributed_rank=$((node_rank * 2))
Thank you @18445864529
@weiyx16 Does it work well if I use 2 GPUs on a single machine (server)? I use --nproc_per_node=2 --nnodes=2 for my training and the process is stopped here without any further progress:
/home/sungwoo/anaconda3/envs/avhubert/lib/python3.8/site-packages/torch/distributed/launch.py:183: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
[2024-04-09 18:26:30,963] torch.distributed.run: [WARNING]
[2024-04-09 18:26:30,963] torch.distributed.run: [WARNING] *****************************************
[2024-04-09 18:26:30,963] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-04-09 18:26:30,963] torch.distributed.run: [WARNING] *****************************************
Hi, is there any instruction on multiple nodes multiple GPUs distributed training with hydra train? https://fairseq.readthedocs.io/en/latest/getting_started.html#distributed-training The fairseq documentation seems to be out-of-date, where hydra does not expect the
local_rank
argument passed bytorch.distributed.launch
. I tried replacetorch.distributed.launch
bytorchrun
which solved thelocal_rank
issue but still didn't seem to make everything correct. Here is the command I tried, and gotRuntimeError: Socket Timeout