Open SimWdm opened 1 day ago
Hi @NKUashin,
let's continue our discussion here 🙂
DistStoreError: Timed out after 1801 seconds waiting for clients. 1/4 clients
It seems that something went wrong with "connecting" (?) the GPUs. But then it's strange that multi GPU fitting worked with the tutorial data.
Are you running DDW on a SLURM cluster or anything similar? If so, could you please share how you configured the job?
Best, Simon
Hi @NKUashin,
let's continue our discussion here 🙂
DistStoreError: Timed out after 1801 seconds waiting for clients. 1/4 clients
It seems that something went wrong with "connecting" (?) the GPUs. But then it's strange that multi GPU fitting worked with the tutorial data.Are you running DDW on a SLURM cluster or anything similar? If so, could you please share how you configured the job?
Best, Simon
[rank: 0] Global seed set to 42
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary.
warnings.warn(_create_warning_msg(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:85: UserWarning: Trying to infer the batch_size
from an ambiguous collection. The batch size we found is 5. To avoid any miscalculations, use self.log(..., batch_size=batch_size)
.
warning_cache.warn(
~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:85: UserWarning: Trying to infer the batch_size
from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use self.log(..., batch_size=batch_size)
.
warning_cache.warn(
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
| Name | Type | Params
0 | unet | Unet3D | 27.3 M
27.3 M Trainable params 2 Non-trainable params 27.3 M Total params 109.289 Total estimated model params size (MB) SLURM auto-requeueing enabled. Setting signal handlers. ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1555: PossibleUserWarning: The number of training batches (29) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch. rank_zero_warn( Computing model-input normalization statistics: 100%|██████████| 29/29 [01:45<00:00, 3.63s/it] .........
nvidia-smi:
cd ~/project/ddw_multigpu_test/
ddw fit-model --config ./config.yaml
May cuda-11.8 has some bugs on our cluster?
Thanks, Young
Hi Young,
thanks for sharing these details!
Ricardo (who raised the the previous issue), has now managed to get multi GPU fitting to work on his SLURM cluster. Aside setting --ntasks-per-node
to the number of gpus you want to use (which you have already done 👍🏼 ), he mentions that he has to precede the ddw fit-model
command with srun
. Could you try that?
Also would it be possible for you to try multi GPU fitting directly on a server, without using SLURM? This could also potentially give us a more usable error trace
I have to apologise I don't have much (any) experience with SLURM.
Best, Simon
@NKUashin reported the following problem: [...] I meet the same bug that multi-GPU fitting not works on my own data, but it works well on tutorial data. My config.yaml file is:
shared: project_dir: "./project" tomo0_files:
prepare_data: mask_files: min_nonzero_mask_fraction_in_subtomo: 0.3 subtomo_extraction_strides: [60, 60, 60] val_fraction: 0.2
fit_model: unet_params_dict: chans: 64 num_downsample_layers: 3 drop_prob: 0.0 adam_params_dict: lr: 0.0004 num_epochs: 1000 batch_size: 1 update_subtomo_missing_wedges_every_n_epochs: 10 check_val_every_n_epochs: 10 save_n_models_with_lowest_val_loss: 5 save_n_models_with_lowest_fitting_loss: 5 save_model_every_n_epochs: 50 logger: "csv"
refine_tomogram: model_checkpoint_file: subtomo_overlap: 32 batch_size: 10
However, it will be ok using 1 GPU on my own data. Here is the report error when I try multi-GPU(4GPUs) on my data.
[rank: 0] Global seed set to 42 Missing logger folder: project/logs GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( [rank: 0] Global seed set to 42 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4 ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/ddw/fi │ │ t_model.py:272 in fit_model │ │ │ │ 269 │ │ val_dataloader = None │ │ 270 │ # fit the model │ │ 271 │ if val_data_exists and resume_from_checkpoint is None: │ │ ❱ 272 │ │ trainer.validate(lit_unet, val_dataloader) │ │ 273 │ trainer.fit( │ │ 274 │ │ #ckpt_path=resume_from_checkpoint, # for pytorch-lightning >= │ │ 275 │ │ model=lit_unet, │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/trainer/trainer.py:662 in validate │ │ │ │ 659 │ │ if model is not None and not isinstance(model, pl.LightningMo │ │ 660 │ │ │ raise TypeError(f"
Trainer.validate()
requires a `Lightn │ │ 661 │ │ self.strategy._lightning_module = model or self.lightning_mod │ │ ❱ 662 │ │ return call._call_and_handle_interrupt( │ │ 663 │ │ │ self, self._validate_impl, model, dataloaders, ckpt_path, │ │ 664 │ │ ) │ │ 665 │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/trainer/call.py:38 in _call_and_handle_interrupt │ │ │ │ 35 │ │ if trainer.strategy.launcher is not None: │ │ 36 │ │ │ return trainer.strategy.launcher.launch(trainer_fn, args, │ │ 37 │ │ else: │ │ ❱ 38 │ │ │ return trainer_fn(args, kwargs) │ │ 39 │ │ │ 40 │ except _TunerExitException: │ │ 41 │ │ trainer._call_teardown_hook() │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/trainer/trainer.py:711 in _validate_impl │ │ │ │ 708 │ │ self._validated_ckpt_path = self.ckpt_path # TODO: remove in │ │ 709 │ │ │ │ 710 │ │ # run validate │ │ ❱ 711 │ │ results = self._run(model, ckpt_path=self.ckpt_path) │ │ 712 │ │ │ │ 713 │ │ assert self.state.stopped │ │ 714 │ │ self.validating = False │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/trainer/trainer.py:994 in _run │ │ │ │ 991 │ │ # SET UP TRAINING │ │ 992 │ │ # ---------------------------- │ │ 993 │ │ log.detail(f"{self.class.name}: setting up strategy e │ │ ❱ 994 │ │ self.strategy.setup_environment() │ │ 995 │ │ self.__setup_profiler() │ │ 996 │ │ │ │ 997 │ │ self._call_setup_hook() # allow user to setup lightning_modu │ │ │ │~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/strategies/ddp.py:153 in setup_environment │ │ │ │ 150 │ │ │ self._rank_0_will_call_children_scripts = True │ │ 151 │ │ │ 152 │ def setup_environment(self) -> None: │ │ ❱ 153 │ │ self.setup_distributed() │ │ 154 │ │ super().setup_environment() │ │ 155 │ │ │ 156 │ def setup(self, trainer: "pl.Trainer") -> None: │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/strategies/ddp.py:204 in setup_distributed │ │ │ │ 201 │ │ rank_zero_only.rank = self.global_rank │ │ 202 │ │ self._process_group_backend = self._get_process_group_backend( │ │ 203 │ │ assert self.cluster_environment is not None │ │ ❱ 204 │ │ _init_dist_connection(self.cluster_environment, self.process │ │ 205 │ │ │ 206 │ def _get_process_group_backend(self) -> str: │ │ 207 │ │ return self._process_group_backend or _get_default_process_gro │ │ │ │~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/lightn │ │ ing_lite/utilities/distributed.py:237 in _init_dist_connection │ │ │ │ 234 │ os.environ["MASTER_ADDR"] = cluster_environment.main_address │ │ 235 │ os.environ["MASTER_PORT"] = str(cluster_environment.main_port) │ │ 236 │ log.info(f"Initializing distributed: GLOBAL_RANK: {global_rank}, M │ │ ❱ 237 │ torch.distributed.init_process_group(torch_distributed_backend, ra │ │ 238 │ │ │ 239 │ # On rank=0 let everyone know training is starting │ │ 240 │ rank_zero_info( │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/ │ │ distributed/c10d_logger.py:86 in wrapper │ │ │ │ 83 │ @functools.wraps(func) │ │ 84 │ def wrapper(*args, *kwargs): │ │ 85 │ │ t1 = time.time_ns() │ │ ❱ 86 │ │ func_return = func(args, kwargs) │ │ 87 │ │ time_spent = time.time_ns() - t1 │ │ 88 │ │ │ │ 89 │ │ msg_dict = _get_msg_dict(func.name, *args, **kwargs) │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/ │ │ distributed/distributed_c10d.py:1177 in init_process_group │ │ │ │ 1174 │ │ │ rendezvous_iterator = rendezvous( │ │ 1175 │ │ │ │ init_method, rank, world_size, timeout=timeout │ │ 1176 │ │ │ ) │ │ ❱ 1177 │ │ │ store, rank, world_size = next(rendezvous_iterator) │ │ 1178 │ │ │ store.set_timeout(timeout) │ │ 1179 │ │ │ │ │ 1180 │ │ │ # Use a PrefixStore to avoid accidental overrides of keys │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/ │ │ distributed/rendezvous.py:246 in _env_rendezvous_handler │ │ │ │ 243 │ master_port = int(_get_env_or_raise("MASTER_PORT")) │ │ 244 │ use_libuv = query_dict.get("use_libuv", os.environ.get("USE_LIBUV" │ │ 245 │ │ │ ❱ 246 │ store = _create_c10d_store(master_addr, master_port, rank, world_s │ │ 247 │ │ │ 248 │ yield (store, rank, world_size) │ │ 249 │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/ │ │ distributed/rendezvous.py:174 in _create_c10dstore │ │ │ │ 171 │ │ return PrefixStore(f"/worker/attempt{attempt}", tcp_store) │ │ 172 │ else: │ │ 173 │ │ start_daemon = rank == 0 │ │ ❱ 174 │ │ return TCPStore( │ │ 175 │ │ │ hostname, port, world_size, start_daemon, timeout, multi_t │ │ 176 │ │ ) │ │ 177 │ ╰──────────────────────────────────────────────────────────────────────────────╯ DistStoreError: Timed out after 1801 seconds waiting for clients. 1/4 clients joined.Do you have any suggestions? Thanks! Young
Originally posted by @NKUashin in https://github.com/MLI-lab/DeepDeWedge/issues/19#issuecomment-2513471518