Closed serizawa-04013958 closed 7 months ago
Thank you for bringing this to our attention. It's indeed unusual that ddp.py
is being invoked, as our codebase does not currently support multi-GPU execution. Could you please check if the --gpu
parameter you've set specifies multiple GPUs? Our system is designed to run with a single GPU, and specifying more than one might lead to unexpected behavior, such as the timeout issue you've encountered.
Thank you for replying! I set single GPU and training code worked. Let me close this issue.
Hello, when I try train_repair.py, the process was timed out, even though I defined maxtime.
========================================error ================================== Traceback (most recent call last): File "/cig/common06nb/deserizk/GaussianObject/train_repair.py", line 189, in
main(args, extras)
File "/cig/common06nb/deserizk/GaussianObject/train_repair.py", line 156, in main
trainer.fit(system, datamodule=dm, ckpt_path=cfg.resume)
File "/cig/common05nb/deserizk/miniconda3/envs/gs-object/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 545, in fit
call._call_and_handle_interrupt(
File "/cig/common05nb/deserizk/miniconda3/envs/gs-object/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs)
File "/cig/common05nb/deserizk/miniconda3/envs/gs-object/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 102, in launch
return function(args, kwargs)
File "/cig/common05nb/deserizk/miniconda3/envs/gs-object/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 581, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/cig/common05nb/deserizk/miniconda3/envs/gs-object/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 948, in _run
self.strategy.setup_environment()
File "/cig/common05nb/deserizk/miniconda3/envs/gs-object/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 146, in setup_environment
self.setup_distributed()
File "/cig/common05nb/deserizk/miniconda3/envs/gs-object/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 197, in setup_distributed
_init_dist_connection(self.cluster_environment, self._process_group_backend, timeout=self._timeout)
File "/cig/common05nb/deserizk/miniconda3/envs/gs-object/lib/python3.9/site-packages/lightning_fabric/utilities/distributed.py", line 290, in _init_dist_connection
torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, kwargs)
File "/cig/common05nb/deserizk/miniconda3/envs/gs-object/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 932, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/cig/common05nb/deserizk/miniconda3/envs/gs-object/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 469, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)