Project-MONAI / research-contributions

Implementations of recent research prototypes/demonstrations using MONAI.
https://monai.io/
Apache License 2.0
1.02k stars 334 forks source link

Problems during loading ssl pretrained weights if training is distributed #169

Closed constantinulrich closed 1 year ago

constantinulrich commented 1 year ago

Hi,

if i train SwinUNETR on BTCV distributed, i can not load the provided self supervised weights. It works for training on a single GPU but not on multiple GPUs. Any ideas? What CUDA and Pytorch Version do you recommend? Thank you!

Traceback (most recent call last): File "/cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/home/c306h/SwinUNETR/research-contributions/SwinUNETR/BTCV/main.py", line 156, in main_worker model_dict = torch.load(os.path.join(root_dir, "pretrained_models/model_swinvit.pt")) File "/cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 607, in load return _load(opened_zipfile, map_location, pickle_module, *pickle_load_args) File "/cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 882, in _load result = unpickler.load() File "/cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 857, in persistent_load load_tensor(data_type, size, key, _maybe_decode_ascii(location)) File "/cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 846, in load_tensor loaded_storages[key] = restore_location(storage, location) File "/cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "/cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 157, in _cuda_deserialize return obj.cuda(device) File "/cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/_utils.py", line 79, in _cuda return newtype(self.size()).copy(self, non_blocking) File "/cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/cuda/init.py", line 528, in _lazy_new return super(_CudaBase, cls).new(cls, args, **kwargs) RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

tangy5 commented 1 year ago

Hi @constantinulrich , thanks for reporting this. The Swin UNETR should be compatible with multi-GPU train, the pretrained weights are loaded into the model,

https://github.com/Project-MONAI/research-contributions/blob/1dde7a112c69785b4fb5cd58d5b73f3fa5bdc577/SwinUNETR/BTCV/main.py#L145 then push to each GPU by:

https://github.com/Project-MONAI/research-contributions/blob/1dde7a112c69785b4fb5cd58d5b73f3fa5bdc577/SwinUNETR/BTCV/main.py#L204

RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable Looking at the logs, It might be issues that some GPU devices are occupied by other threads. Can you double check this, if the problem still exists, Can you run with CUDA_LAUNCH_BLOCKING=1, and see if there are detailed information that we can deep into.

Thanks.

constantinulrich commented 1 year ago

Hi,

thanks for your fast reply. I tested it locally and it works (also distributed). Must be sth wrong with my cluster env.

Running with CUDA_LAUNCH_BLOCKING=1 does not really help:

/cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache)) Traceback (most recent call last): File "/home/c306h/SwinUNETR/research-contributions/SwinUNETR/BTCV/main.py", line 259, in main() File "/home/c306h/SwinUNETR/research-contributions/SwinUNETR/BTCV/main.py", line 107, in main mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args,)) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/c306h/SwinUNETR/research-contributions/SwinUNETR/BTCV/main.py", line 156, in main_worker model_dict = torch.load(os.path.join(root_dir, "pretrained_models/model_swinvit.pt")) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 712, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 1049, in _load result = unpickler.load() File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 1019, in persistent_load load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 1001, in load_tensor wrap_storage=restore_location(storage, location), File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 157, in _cuda_deserialize return obj.cuda(device) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/_utils.py", line 78, in _cuda return torch.UntypedStorage(self.size(), device=torch.device('cuda')).copy(self, non_blocking) RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable