MIC-DKFZ / nnUNet

Apache License 2.0
5.9k stars 1.76k forks source link

Problems with CUDA memory when validation from raw_data. #1771

Closed Overflowu7 closed 6 months ago

Overflowu7 commented 1 year ago

When a finish the train epcoh and do the validation ,the "CUDA out of memory" appeared. I tried to use two Gpus at the same time but it didn't seem to help. What can I do to make the verification go through? Can you reduce the amount of validation to save Gpus?

2023-10-28 11:23:19.306118: Tried to trace <torch.torch.classes.c10d.ProcessGroup object at 0x557c5be39610> but it is not part of the active trace. Modules that are called during a trace must be registered as submodules of the thing being traced. This random 80:20 split has 82 training and 21 validation cases. predicting dataset6_CLINIC_0008_data 2023-10-28 11:23:19.829983: Training done. 2023-10-28 11:23:20.225806: Using splits from existing split file: /home/wu/wyc/nnUNet/DATA/nnUNet_preprocessed/Dataset011_4pelvis/splits_final.json 2023-10-28 11:23:20.226441: The split file contains 5 splits. 2023-10-28 11:23:20.226492: Desired fold for training: 46 2023-10-28 11:23:20.226524: INFO: You requested fold 46 for training but splits contain only 5 folds. I am now creating a random (but seeded) 80:20 split! 2023-10-28 11:23:20.228057: This random 80:20 split has 82 training and 21 validation cases. 2023-10-28 11:23:20.228576: predicting dataset6_CLINIC_0002_data 2023-10-28 11:24:06.188117: predicting dataset6_CLINIC_0009_data 2023-10-28 11:25:16.599443: predicting dataset6_CLINIC_0020_data 2023-10-28 11:26:21.311249: predicting dataset6_CLINIC_0027_data 2023-10-28 11:27:24.793235: predicting dataset6_CLINIC_0055_data Traceback (most recent call last): File "/home/wu/.conda/envs/nnUNet/bin/nnUNetv2_train", line 33, in sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')()) File "/home/wu/wyc/nnUNet/nnunetv2/run/run_training.py", line 281, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/home/wu/wyc/nnUNet/nnunetv2/run/run_training.py", line 171, in run_training mp.spawn(run_ddp, File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 239, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes while not context.join(): File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 160, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/wu/wyc/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1343, in perform_actual_validation prediction = predictor.predict_sliding_window_return_logits(data) File "/home/wu/wyc/nnUNet/nnunetv2/inference/predict_from_raw_data.py", line 633, in predict_sliding_window_return_logits prediction = self._internal_maybe_mirror_and_predict(workon)[0].to(results_device) File "/home/wu/wyc/nnUNet/nnunetv2/inference/predict_from_raw_data.py", line 555, in _internal_maybe_mirror_and_predict prediction += torch.flip(self.network(torch.flip(x, (3,))), (3,)) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) # type: ignore[index] File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/wu/wyc/nnUNet/nnunetv2/training/nnUNetTrainer/SwinMM/Neo/network_architecture/NexToU.py", line 157, in forward return self.decoder(final_tensor_list) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/wu/wyc/nnUNet/nnunetv2/training/nnUNetTrainer/SwinMM/Neo/network_architecture/NexToU_Encoder_Decoder.py", line 321, in forward x = self.stagess File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/dynamic_network_architectures/building_blocks/simple_conv_blocks.py", line 137, in forward return self.convs(x) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/dynamic_network_architectures/building_blocks/simple_conv_blocks.py", line 71, in forward return self.all_modules(x) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 741, in forward return F.batch_norm( File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/functional.py", line 2450, in batch_norm return torch.batch_norm( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 23.70 GiB total capacity; 19.05 GiB already allocated; 78.56 MiB free; 19.26 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 69, in _wrap fn(i, args) File "/home/wu/wyc/nnUNet/nnunetv2/run/run_training.py", line 134, in run_ddp nnunet_trainer.perform_actual_validation(npz) File "/home/wu/wyc/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1346, in perform_actual_validation prediction = predictor.predict_sliding_window_return_logits(data) File "/home/wu/wyc/nnUNet/nnunetv2/inference/predict_from_raw_data.py", line 633, in predict_sliding_window_return_logits prediction = self._internal_maybe_mirror_and_predict(workon)[0].to(results_device) File "/home/wu/wyc/nnUNet/nnunetv2/inference/predict_from_raw_data.py", line 565, in _internal_maybe_mirror_and_predict prediction += torch.flip(self.network(torch.flip(x, (2, 3, 4))), (2, 3, 4)) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) # type: ignore[index] File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/wu/wyc/nnUNet/nnunetv2/training/nnUNetTrainer/SwinMM/Neo/network_architecture/NexToU.py", line 157, in forward return self.decoder(final_tensor_list) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/wu/wyc/nnUNet/nnunetv2/training/nnUNetTrainer/SwinMM/Neo/network_architecture/NexToU_Encoder_Decoder.py", line 321, in forward x = self.stagess File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/dynamic_network_architectures/building_blocks/simple_conv_blocks.py", line 137, in forward return self.convs(x) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/dynamic_network_architectures/building_blocks/simple_conv_blocks.py", line 71, in forward return self.all_modules(x) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 741, in forward return F.batch_norm( File "/home/wu/.conda/envs/nnUNet/lib/python3.9/site-packages/torch/nn/functional.py", line 2450, in batch_norm return torch.batch_norm( torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 114.00 MiB (GPU 0; 23.70 GiB total capacity; 18.61 GiB already allocated; 92.56 MiB free; 19.25 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

yonuyeung commented 1 year ago

I had the same problem, I initially tried using three Gpus, but nothing worked. I ended up switching to a GPU with more video memory.

JinYangprominent1994 commented 1 year ago

The same problem. Only several cases were validated.

GregorKoehler commented 11 months ago

Hi @Overflowu7 @yonuyeung and @PromiNent-Jin,

sorry for the late reply to this issue! Per default, nnUNet v2 tries to perform the sliding window prediction on GPU. As this can be memory intensive, there's a try/except which should catch the OOM case. Are you working with the latest version of nnUNet v2?

Overflowu7 commented 11 months ago

Hi @Overflowu7 @yonuyeung and @PromiNent-Jin,

sorry for the late reply to this issue! Per default, nnUNet v2 tries to perform the sliding window prediction on GPU. As this can be memory intensive, there's a try/except which should catch the OOM case. Are you working with the latest version of nnUNet v2?

Yes I use the latest version

ancestor-mithril commented 10 months ago

Hi @Overflowu7 @yonuyeung and @PromiNent-Jin, sorry for the late reply to this issue! Per default, nnUNet v2 tries to perform the sliding window prediction on GPU. As this can be memory intensive, there's a try/except which should catch the OOM case. Are you working with the latest version of nnUNet v2?

Yes I use the latest version

It doesn't look like v2.2.

File "/home/wu/wyc/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1346, in perform_actual_validation
prediction = predictor.predict_sliding_window_return_logits(data)

In v2.2, this is at line 1157: https://github.com/MIC-DKFZ/nnUNet/blob/8f92709af2dd8c4cc02d2ec3d70e861b005059da/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py#L1157

GregorKoehler commented 6 months ago

Hi @Overflowu7, can you confirm if the problem persists if you update to v2.2?

Overflowu7 commented 6 months ago

The problem was solved when I updated to the new version.

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2024年4月22日(星期一) 晚上8:49 收件人: @.>; 抄送: "- @.>; @.>; 主题: Re: [MIC-DKFZ/nnUNet] Problems with CUDA memory when validation from raw_data. (Issue #1771)

Hi @Overflowu7, can you confirm if the problem persists if you update to v2.2?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

GregorKoehler commented 6 months ago

Thank you for commenting. I'm glad the issue got resolved!

Closing this issue then :)