MLI-lab / DeepDeWedge

Self-supervised deep learning for denoising and missing wedge reconstruction of cryo-ET tomograms
BSD 2-Clause "Simplified" License
34 stars 8 forks source link

DistStoreError when using 4 GPUs #20

Open SimWdm opened 1 day ago

SimWdm commented 1 day ago

@NKUashin reported the following problem: [...] I meet the same bug that multi-GPU fitting not works on my own data, but it works well on tutorial data. My config.yaml file is:

shared: project_dir: "./project" tomo0_files:

prepare_data: mask_files: min_nonzero_mask_fraction_in_subtomo: 0.3 subtomo_extraction_strides: [60, 60, 60] val_fraction: 0.2

fit_model: unet_params_dict: chans: 64 num_downsample_layers: 3 drop_prob: 0.0 adam_params_dict: lr: 0.0004 num_epochs: 1000 batch_size: 1 update_subtomo_missing_wedges_every_n_epochs: 10 check_val_every_n_epochs: 10 save_n_models_with_lowest_val_loss: 5 save_n_models_with_lowest_fitting_loss: 5 save_model_every_n_epochs: 50 logger: "csv"

refine_tomogram: model_checkpoint_file: subtomo_overlap: 32 batch_size: 10

However, it will be ok using 1 GPU on my own data. Here is the report error when I try multi-GPU(4GPUs) on my data.

[rank: 0] Global seed set to 42 Missing logger folder: project/logs GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( [rank: 0] Global seed set to 42 Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4 ╭───────────────────── Traceback (most recent call last) ──────────────────────╮ │~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/ddw/fi │ │ t_model.py:272 in fit_model │ │ │ │ 269 │ │ val_dataloader = None │ │ 270 │ # fit the model │ │ 271 │ if val_data_exists and resume_from_checkpoint is None: │ │ ❱ 272 │ │ trainer.validate(lit_unet, val_dataloader) │ │ 273 │ trainer.fit( │ │ 274 │ │ #ckpt_path=resume_from_checkpoint, # for pytorch-lightning >= │ │ 275 │ │ model=lit_unet, │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/trainer/trainer.py:662 in validate │ │ │ │ 659 │ │ if model is not None and not isinstance(model, pl.LightningMo │ │ 660 │ │ │ raise TypeError(f"Trainer.validate() requires a `Lightn │ │ 661 │ │ self.strategy._lightning_module = model or self.lightning_mod │ │ ❱ 662 │ │ return call._call_and_handle_interrupt( │ │ 663 │ │ │ self, self._validate_impl, model, dataloaders, ckpt_path, │ │ 664 │ │ ) │ │ 665 │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/trainer/call.py:38 in _call_and_handle_interrupt │ │ │ │ 35 │ │ if trainer.strategy.launcher is not None: │ │ 36 │ │ │ return trainer.strategy.launcher.launch(trainer_fn, args, │ │ 37 │ │ else: │ │ ❱ 38 │ │ │ return trainer_fn(args, kwargs) │ │ 39 │ │ │ 40 │ except _TunerExitException: │ │ 41 │ │ trainer._call_teardown_hook() │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/trainer/trainer.py:711 in _validate_impl │ │ │ │ 708 │ │ self._validated_ckpt_path = self.ckpt_path # TODO: remove in │ │ 709 │ │ │ │ 710 │ │ # run validate │ │ ❱ 711 │ │ results = self._run(model, ckpt_path=self.ckpt_path) │ │ 712 │ │ │ │ 713 │ │ assert self.state.stopped │ │ 714 │ │ self.validating = False │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/trainer/trainer.py:994 in _run │ │ │ │ 991 │ │ # SET UP TRAINING │ │ 992 │ │ # ---------------------------- │ │ 993 │ │ log.detail(f"{self.class.name}: setting up strategy e │ │ ❱ 994 │ │ self.strategy.setup_environment() │ │ 995 │ │ self.__setup_profiler() │ │ 996 │ │ │ │ 997 │ │ self._call_setup_hook() # allow user to setup lightning_modu │ │ │ │~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/strategies/ddp.py:153 in setup_environment │ │ │ │ 150 │ │ │ self._rank_0_will_call_children_scripts = True │ │ 151 │ │ │ 152 │ def setup_environment(self) -> None: │ │ ❱ 153 │ │ self.setup_distributed() │ │ 154 │ │ super().setup_environment() │ │ 155 │ │ │ 156 │ def setup(self, trainer: "pl.Trainer") -> None: │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorc │ │ h_lightning/strategies/ddp.py:204 in setup_distributed │ │ │ │ 201 │ │ rank_zero_only.rank = self.global_rank │ │ 202 │ │ self._process_group_backend = self._get_process_group_backend( │ │ 203 │ │ assert self.cluster_environment is not None │ │ ❱ 204 │ │ _init_dist_connection(self.cluster_environment, self.process │ │ 205 │ │ │ 206 │ def _get_process_group_backend(self) -> str: │ │ 207 │ │ return self._process_group_backend or _get_default_process_gro │ │ │ │~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/lightn │ │ ing_lite/utilities/distributed.py:237 in _init_dist_connection │ │ │ │ 234 │ os.environ["MASTER_ADDR"] = cluster_environment.main_address │ │ 235 │ os.environ["MASTER_PORT"] = str(cluster_environment.main_port) │ │ 236 │ log.info(f"Initializing distributed: GLOBAL_RANK: {global_rank}, M │ │ ❱ 237 │ torch.distributed.init_process_group(torch_distributed_backend, ra │ │ 238 │ │ │ 239 │ # On rank=0 let everyone know training is starting │ │ 240 │ rank_zero_info( │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/ │ │ distributed/c10d_logger.py:86 in wrapper │ │ │ │ 83 │ @functools.wraps(func) │ │ 84 │ def wrapper(*args, *kwargs): │ │ 85 │ │ t1 = time.time_ns() │ │ ❱ 86 │ │ func_return = func(args, kwargs) │ │ 87 │ │ time_spent = time.time_ns() - t1 │ │ 88 │ │ │ │ 89 │ │ msg_dict = _get_msg_dict(func.name, *args, **kwargs) │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/ │ │ distributed/distributed_c10d.py:1177 in init_process_group │ │ │ │ 1174 │ │ │ rendezvous_iterator = rendezvous( │ │ 1175 │ │ │ │ init_method, rank, world_size, timeout=timeout │ │ 1176 │ │ │ ) │ │ ❱ 1177 │ │ │ store, rank, world_size = next(rendezvous_iterator) │ │ 1178 │ │ │ store.set_timeout(timeout) │ │ 1179 │ │ │ │ │ 1180 │ │ │ # Use a PrefixStore to avoid accidental overrides of keys │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/ │ │ distributed/rendezvous.py:246 in _env_rendezvous_handler │ │ │ │ 243 │ master_port = int(_get_env_or_raise("MASTER_PORT")) │ │ 244 │ use_libuv = query_dict.get("use_libuv", os.environ.get("USE_LIBUV" │ │ 245 │ │ │ ❱ 246 │ store = _create_c10d_store(master_addr, master_port, rank, world_s │ │ 247 │ │ │ 248 │ yield (store, rank, world_size) │ │ 249 │ │ │ │ ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/ │ │ distributed/rendezvous.py:174 in _create_c10dstore │ │ │ │ 171 │ │ return PrefixStore(f"/worker/attempt{attempt}", tcp_store) │ │ 172 │ else: │ │ 173 │ │ start_daemon = rank == 0 │ │ ❱ 174 │ │ return TCPStore( │ │ 175 │ │ │ hostname, port, world_size, start_daemon, timeout, multi_t │ │ 176 │ │ ) │ │ 177 │ ╰──────────────────────────────────────────────────────────────────────────────╯ DistStoreError: Timed out after 1801 seconds waiting for clients. 1/4 clients joined.

Do you have any suggestions? Thanks! Young

Originally posted by @NKUashin in https://github.com/MLI-lab/DeepDeWedge/issues/19#issuecomment-2513471518

SimWdm commented 1 day ago

Hi @NKUashin,

let's continue our discussion here 🙂

DistStoreError: Timed out after 1801 seconds waiting for clients. 1/4 clients It seems that something went wrong with "connecting" (?) the GPUs. But then it's strange that multi GPU fitting worked with the tutorial data.

Are you running DDW on a SLURM cluster or anything similar? If so, could you please share how you configured the job?

Best, Simon

NKUashin commented 1 day ago

Hi @NKUashin,

let's continue our discussion here 🙂

DistStoreError: Timed out after 1801 seconds waiting for clients. 1/4 clients It seems that something went wrong with "connecting" (?) the GPUs. But then it's strange that multi GPU fitting worked with the tutorial data.

Are you running DDW on a SLURM cluster or anything similar? If so, could you please share how you configured the job?

Best, Simon

Yes, when I worked with the tutorial data, the output info showes below, and it worked on 4 GPUs well. I didn't change any parameters in config.yaml.

[rank: 0] Global seed set to 42 GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/torch/utils/data/dataloader.py:558: UserWarning: This DataLoader will create 10 worker processes in total. Our suggested max number of worker in current system is 4, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg( LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3] ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:85: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 5. To avoid any miscalculations, use self.log(..., batch_size=batch_size). warning_cache.warn( ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorch_lightning/utilities/data.py:85: UserWarning: Trying to infer the batch_size from an ambiguous collection. The batch size we found is 1. To avoid any miscalculations, use self.log(..., batch_size=batch_size). warning_cache.warn( LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]

| Name | Type | Params

0 | unet | Unet3D | 27.3 M

27.3 M Trainable params 2 Non-trainable params 27.3 M Total params 109.289 Total estimated model params size (MB) SLURM auto-requeueing enabled. Setting signal handlers. ~/.conda/envs/ddw_env_multigpu/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py:1555: PossibleUserWarning: The number of training batches (29) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch. rank_zero_warn( Computing model-input normalization statistics: 100%|██████████| 29/29 [01:45<00:00, 3.63s/it] .........


Besides, I'm using a SLURM cluster, my GPU info and slurm submit script are here:

nvidia-smi: image

slurm script:

!/bin/bash

SBATCH --ntasks-per-node=4

SBATCH --partition=gpu-c

SBATCH --error=./fit_run.err

SBATCH --output=./fit_run.out

SBATCH --time=96:0:0

SBATCH --gres=gpu:tesla:4

cd ~/project/ddw_multigpu_test/

ddw fit-model --config ./config.yaml


May cuda-11.8 has some bugs on our cluster?

Thanks, Young

SimWdm commented 1 day ago

Hi Young,

thanks for sharing these details!

Ricardo (who raised the the previous issue), has now managed to get multi GPU fitting to work on his SLURM cluster. Aside setting --ntasks-per-node to the number of gpus you want to use (which you have already done 👍🏼 ), he mentions that he has to precede the ddw fit-model command with srun. Could you try that?

Also would it be possible for you to try multi GPU fitting directly on a server, without using SLURM? This could also potentially give us a more usable error trace

I have to apologise I don't have much (any) experience with SLURM.

Best, Simon