torchrun failed to run on multi nodes

I meet the following error with training llm on 2 gpu machines:

torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=11.214.25.222 --master_port 9999 --rdzv_id=1986 --rdzv_backend=c10d cosyvoice/bin/train.py --train_engine torch_ddp --config conf/cosyvoice.llm.lfe.yaml --train_data data/zeroshot_large/parquet_phoneme/data.list --cv_data data/zeroshot_dev/parquet_phoneme/data.list --model llm --model_dir CosyVoice/examples/libritts/cosyvoice/exp/cosyvoice_lfe_zeroshot_large/llm/torch_ddp --tensorboard_dir CosyVoice/examples/libritts/cosyvoice/tensorboard/cosyvoice_lfe_zeroshot_large/llm/torch_ddp --ddp.dist_backend nccl --num_workers 2 --prefetch 100 --pin_memory --deepspeed_config ./conf/ds_stage2.json --deepspeed.save_states model+optimizer master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified. Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper return f(*args, kwargs) File "/usr/local/lib64/python3.8/site-packages/torch/distributed/run.py", line 794, in main run(args) File "/usr/local/lib64/python3.8/site-packages/torch/distributed/run.py", line 785, in run elastic_launch( File "/usr/local/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent result = agent.run() File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, *kwargs) File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run result = self._invoke_run(role) File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run self._initialize_workers(self._worker_group) File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(args, kwargs) File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers self._rendezvous(worker_group) File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper result = f(*args, **kwargs) File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous() File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1028, in next_rendezvous self._op_executor.run(join_op, deadline) File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 638, in run raise RendezvousTimeoutError() torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError

How to set torchrun options when multi gpu nodes are used?

FunAudioLLM / CosyVoice

torchrun failed to run on multi nodes #241