I meet the following error with training llm on 2 gpu machines:
torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --master_addr=11.214.25.222 --master_port 9999 --rdzv_id=1986 --rdzv_backend=c10d cosyvoice/bin/train.py --train_engine torch_ddp --config conf/cosyvoice.llm.lfe.yaml --train_data data/zeroshot_large/parquet_phoneme/data.list --cv_data data/zeroshot_dev/parquet_phoneme/data.list --model llm --model_dir CosyVoice/examples/libritts/cosyvoice/exp/cosyvoice_lfe_zeroshot_large/llm/torch_ddp --tensorboard_dir CosyVoice/examples/libritts/cosyvoice/tensorboard/cosyvoice_lfe_zeroshot_large/llm/torch_ddp --ddp.dist_backend nccl --num_workers 2 --prefetch 100 --pin_memory --deepspeed_config ./conf/ds_stage2.json --deepspeed.save_states model+optimizer
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
Traceback (most recent call last):
File "/usr/local/bin/torchrun", line 8, in
sys.exit(main())
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, kwargs)
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, *kwargs)
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
self._initialize_workers(self._worker_group)
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(args, kwargs)
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
self._rendezvous(worker_group)
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 1028, in next_rendezvous
self._op_executor.run(join_op, deadline)
File "/usr/local/lib64/python3.8/site-packages/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py", line 638, in run
raise RendezvousTimeoutError()
torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError
How to set torchrun options when multi gpu nodes are used?
I meet the following error with training llm on 2 gpu machines:
How to set torchrun options when multi gpu nodes are used?