Closed constantinulrich closed 1 year ago
Hi @constantinulrich , thanks for reporting this. The Swin UNETR should be compatible with multi-GPU train, the pretrained weights are loaded into the model,
https://github.com/Project-MONAI/research-contributions/blob/1dde7a112c69785b4fb5cd58d5b73f3fa5bdc577/SwinUNETR/BTCV/main.py#L145 then push to each GPU by:
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
Looking at the logs, It might be issues that some GPU devices are occupied by other threads. Can you double check this, if the problem still exists,
Can you run with CUDA_LAUNCH_BLOCKING=1
, and see if there are detailed information that we can deep into.
Thanks.
Hi,
thanks for your fast reply. I tested it locally and it works (also distributed). Must be sth wrong with my cluster env.
Running with CUDA_LAUNCH_BLOCKING=1 does not really help:
/cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/multiprocessing/semaphore_tracker.py:144: UserWarning: semaphore_tracker: There appear to be 1 leaked semaphores to clean up at shutdown len(cache)) Traceback (most recent call last): File "/home/c306h/SwinUNETR/research-contributions/SwinUNETR/BTCV/main.py", line 259, in
main() File "/home/c306h/SwinUNETR/research-contributions/SwinUNETR/BTCV/main.py", line 107, in main mp.spawn(main_worker, nprocs=args.ngpus_per_node, args=(args,)) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 240, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 198, in start_processes while not context.join(): File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException: -- Process 0 terminated with the following error: Traceback (most recent call last): File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, *args) File "/home/c306h/SwinUNETR/research-contributions/SwinUNETR/BTCV/main.py", line 156, in main_worker model_dict = torch.load(os.path.join(root_dir, "pretrained_models/model_swinvit.pt")) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 712, in load return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 1049, in _load result = unpickler.load() File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 1019, in persistent_load load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location)) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 1001, in load_tensor wrap_storage=restore_location(storage, location), File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 175, in default_restore_location result = fn(storage, location) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/serialization.py", line 157, in _cuda_deserialize return obj.cuda(device) File "//cluster/gpu/data/OE0441/c306h/alternative_env/lib/python3.7/site-packages/torch/_utils.py", line 78, in _cuda return torch.UntypedStorage(self.size(), device=torch.device('cuda')).copy(self, non_blocking) RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
Hi,
if i train SwinUNETR on BTCV distributed, i can not load the provided self supervised weights. It works for training on a single GPU but not on multiple GPUs. Any ideas? What CUDA and Pytorch Version do you recommend? Thank you!