ERROR while preprocessing

karon999 commented 7 months ago

I have trained the toy dataset successfully, but when I want to train my own dataset, some wired error occurs. I have already full_check the dataset and it's okay. Thanks a lot in advance for any help.

Here is the error occurs detections_per_img: 100 score_thresh: 0 topk_candidates: 10000 remove_small_boxes: 0.01 nms_thresh: 0.6 2024-02-23 12:47:28.962 | INFO 2024-02-23 12:47:29.058 | INFO 2024-02-23 12:47:29.058 | INFO 2024-02-23 12:47:36.843 | INFO 2024-02-23 12:47:37.252 | INFO 2024-02-23 12:47:37.394 | INFO 2024-02-23 12:47:37.394 | INFO 2024-02-23 12:47:37.394 | INFO 2024-02-23 12:47:37.395 | INFO 2024-02-23 12:47:37.395 | INFO 2024-02-23 12:47:37.548 | INFO 2024-02-23 12:47:37.575 | INFO 2024-02-23 12:47:37.575 | INFO 2024-02-23 12:47:37.585 | INFO 2024-02-23 12:47:37.593 | INFO 2024-02-23 12:47:37.604 | INFO 2024-02-23 12:47:37.605 | INFO 2024-02-23 12:47:37.615 | INFO 2024-02-23 12:47:37.615 | INFO 2024-02-23 12:47:37.615 | INFO 2024-02-23 12:47:37.616 | INFO 2024-02-23 12:47:37.616 | INFO detections_per_img: 100 score_thresh: 0 topk_candidates: 10000 remove_small_boxes: 0.01 nms_thresh: 0.6 2024-02-23 12:47:37.724 | INFO 2024-02-23 12:47:37.805 | INFO 2024-02-23 12:47:37.805 | INFO 2024-02-23 12:48:07.135 | INFO You can try to repro while preprocessing: | nndet.planning.estimator:estimate:123 - Found available gpu memory: 16919691264 bytes / 16135.875 mb and estimating for 11511726080 bytes / 10978.4375 | nndet.planning.estimator:_estimate_mem_available:154 - Estimating in memory. | nndet.planning.estimator:measure:193 - Estimating on cuda:0 with shape [1, 64, 224, 192] and batch size 4 and num_instances 5 | nndet.planning.estimator:measure:242 - Caught error (If out of memory error do not worry): CUDA out of memory. Tried to allocate 488.00 MiB (GPU 0; 15.90 GiB total capacity; 13.86 GiB already allocated; 231.75 MiB free; 14.66 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF | nndet.planning.estimator:measure:256 - Measured: 0.0 mb empty, inf mb fixed, inf mb dynamic | nndet.ptmodule.retinaunet.base:from_config_plan:362 - Architecture overwrites: {} Anchor overwrites: {} | nndet.ptmodule.retinaunet.base:from_config_plan:364 - Building architecture according to plan of not_found | nndet.ptmodule.retinaunet.base:from_config_plan:367 - Start channels: 32; head channels: 128; fpn channels: 128 | nndet.core.boxes.anchors:init:288 - Discarding anchor generator kwargs {'stride': 1} | nndet.ptmodule.retinaunet.base:_build_encoder:464 - Building:: encoder Encoder: {} | nndet.ptmodule.retinaunet.base:_build_decoder:496 - Building:: decoder UFPNModular: {'min_out_channels': 8, 'upsampling_mode': 'transpose', 'num_lateral': 1, 'norm_lateral': False, 'activation_lateral': False, 'num_out': 1, 'norm_out': False, 'activation_out': False} | nndet.core.boxes.matcher.atss:init:45 - Running ATSS Matching with num_candidates=4 and center_in_gt False. | nndet.ptmodule.retinaunet.base:_build_head_classifier:530 - Building:: classifier BCECLassifier: {'num_convs': 1, 'norm_channels_per_group': 16, 'norm_affine': True, 'reduction': 'mean', 'loss_weight': 1.0, 'prior_prob': 0.01} | nndet.arch.heads.classifier:init_weights:215 - Init classifier weights: prior prob 0.01 | nndet.ptmodule.retinaunet.base:_build_head_regressor:564 - Building:: regressor GIoURegressor: {'num_convs': 1, 'norm_channels_per_group': 16, 'norm_affine': True, 'reduction': 'sum', 'loss_weight': 1.0, 'learn_scale': True} | nndet.arch.heads.regressor:build_scales:150 - Learning level specific scalar in regressor | nndet.arch.heads.regressor:init_weights:196 - Overwriting regressor conv weight init | nndet.ptmodule.retinaunet.base:_build_head:602 - Building:: head DetectionHeadHNMNative: {} sampler HardNegativeSamplerBatched: {'batch_size_per_image': 32, 'positive_fraction': 0.33, 'pool_size': 20, 'min_neg': 1} | nndet.core.boxes.sampler:init:235 - Sampling hard negatives on a per batch basis | nndet.ptmodule.retinaunet.base:_build_segmenter:638 - Building:: segmenter DiCESegmenterFgBg {'dice_kwargs': {'batch_dice': True, 'smooth_nom': 1e-05, 'smooth_denom': 1e-05, 'do_bg': False}} | nndet.losses.segmentation:init:108 - Running batch dice True and do bg False in dice loss. | nndet.ptmodule.retinaunet.base:from_config_plan:421 - Model Inference Summary: | nndet.planning.estimator:estimate:123 - Found available gpu memory: 16919691264 bytes / 16135.875 mb and estimating for 11511726080 bytes / 10978.4375 | nndet.planning.estimator:_estimate_mem_available:154 - Estimating in memory. | nndet.planning.estimator:measure:193 - Estimating on cuda:0 with shape [1, 64, 192, 192] and batch size 4 and num_instances 5 | nndet.planning.estimator:measure:242 - Caught error (If out of memory error do not worry): cuDNN error: CUDNN_STATUS_INTERNAL_ERROR this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = True torch.backends.cudnn.allow_tf32 = True data = torch.randn([4, 32, 64, 192, 192], dtype=torch.half, device='cuda', requires_grad=True) net = torch.nn.Conv3d(32, 32, kernel_size=[1, 3, 3], padding=[0, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1], groups=1) net = net.cuda().half() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()

ConvolutionParams data_type = CUDNN_DATA_HALF padding = [0, 1, 1] stride = [1, 1, 1] dilation = [1, 1, 1] groups = 1 deterministic = true allow_tf32 = true input: TensorDescriptor 0x7f1bf00f32f0 type = CUDNN_DATA_HALF nbDims = 5 dimA = 4, 32, 64, 192, 192, strideA = 75497472, 2359296, 36864, 192, 1, output: TensorDescriptor 0x7f1bf00f53e0 type = CUDNN_DATA_HALF nbDims = 5 dimA = 4, 32, 64, 192, 192, strideA = 75497472, 2359296, 36864, 192, 1, weight: FilterDescriptor 0x7f1bf00f5c10 type = CUDNN_DATA_HALF tensor_format = CUDNN_TENSOR_NCHW nbDims = 5 dimA = 32, 32, 1, 3, 3, Pointer addresses: input: 0x7f1998e00000 output: 0x7f17f8000000 weight: 0x7f18d47fa000 Additional pointer addresses: grad_output: 0x7f17f8000000 grad_input: 0x7f1998e00000 Backward data algorithm: 3

Here is the Traceback imformation: Traceback (most recent call last): File "/home/wangpeiyu/anaconda3/envs/nndetection/bin/nndet_prep", line 33, in sys.exit(load_entry_point('nndet', 'console_scripts', 'nndet_prep')()) File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/utils/check.py", line 62, in wrapper return func(*args, **kwargs) File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 418, in main run(OmegaConf.to_container(cfg, resolve=True), File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 347, in run run_planning_and_process( File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 174, in run_planning_and_process plan_identifiers = planner.plan_experiment( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/experiment/v001.py", line 43, in plan_experiment plan_3d = self.plan_base_stage( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/experiment/base.py", line 234, in plan_base_stage architecture_plan = architecture_planner.plan( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/c002.py", line 127, in plan res = super().plan( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/base.py", line 346, in plan patch_size = self._plan_architecture( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/c002.py", line 205, in _planarchitecture , fits_in_mem = self.estimator.estimate( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 128, in estimate res = self._estimate_mem_available( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 155, in _estimate_mem_available fixed, dynamic = self.measure(shape=target_shape, File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 253, in measure network.cpu() File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 714, in cpu return self._apply(lambda t: t.cpu()) File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply module._apply(fn) File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply module._apply(fn) File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 597, in _apply param_applied = fn(param) File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 714, in return self._apply(lambda t: t.cpu()) RuntimeError: CUDA error: an illegal memory access was encountered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

And following is the nndet_env: ----- PyTorch Information ----- PyTorch Version: 1.10.1+cu111 PyTorch Debug: False PyTorch CUDA: 11.1 PyTorch Backend cudnn: 8005 PyTorch CUDA Arch List: ['sm_37', 'sm_50', 'sm_60', 'sm_70', 'sm_75', 'sm_80', 'sm_86'] PyTorch Current Device Capability: (6, 0) PyTorch CUDA available: True

----- System Information ----- System NVCC: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 NVIDIA Corporation Built on Tue_Sep_15_19:10:02_PDT_2020 Cuda compilation tools, release 11.1, V11.1.74 Build cuda_11.1.TC455_06.29069683_0

System Arch List: None System OMP_NUM_THREADS: 1 System CUDA_HOME is None: True System CPU Count: 32 Python Version: 3.9.0 (default, Nov 15 2020, 14:28:56) [GCC 7.3.0]

----- nnDetection Information ----- det_num_threads 6 det_data is set True det_models is set True

mibaumgartner commented 6 months ago

Dear @karon999 ,

can you start it with CUDA_LAUNCH_BLOCKING=1 or check for any other inconsistencies? The error does not really show what is going wrong right now.

Best, Michael

karon999 commented 6 months ago

After adding CUDA_LAUNCH_BLOCKING=1 Now this is the error 2024-02-29 03:40:25.625 | INFO | nndet.planning.estimator:measure:193 - Estimating on cuda:0 with shape [1, 64, 192, 192] and batch size 4 and num_instances 5 2024-02-29 03:40:55.235 | INFO | nndet.planning.estimator:measure:242 - Caught error (If out of memory error do not worry): cuDNN error: CUDNN_STATUS_EXECUTION_FAILED You can try to repro this exception using the following code snippet. If that doesn't trigger the error, please include your original repro script when reporting this issue.

import torch torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cudnn.benchmark = True torch.backends.cudnn.deterministic = True torch.backends.cudnn.allow_tf32 = True data = torch.randn([4, 128, 64, 48, 48], dtype=torch.half, device='cuda', requires_grad=True) net = torch.nn.Conv3d(128, 128, kernel_size=[3, 3, 3], padding=[1, 1, 1], stride=[1, 1, 1], dilation=[1, 1, 1], groups=1) net = net.cuda().half() out = net(data) out.backward(torch.randn_like(out)) torch.cuda.synchronize()

ConvolutionParams data_type = CUDNN_DATA_HALF padding = [1, 1, 1] stride = [1, 1, 1] dilation = [1, 1, 1] groups = 1 deterministic = true allow_tf32 = true input: TensorDescriptor 0x7ff0bc040170 type = CUDNN_DATA_HALF nbDims = 5 dimA = 4, 128, 64, 48, 48, strideA = 18874368, 147456, 2304, 48, 1, output: TensorDescriptor 0x7ff0bc0088a0 type = CUDNN_DATA_HALF nbDims = 5 dimA = 4, 128, 64, 48, 48, strideA = 18874368, 147456, 2304, 48, 1, weight: FilterDescriptor 0x564c610a2de0 type = CUDNN_DATA_HALF tensor_format = CUDNN_TENSOR_NCHW nbDims = 5 dimA = 128, 128, 3, 3, 3, Pointer addresses: input: 0x7fee27800000 output: 0x7fee1e800000 weight: 0x7ff034ed8000 Additional pointer addresses: grad_output: 0x7fee1e800000 grad_weight: 0x7ff034ed8000 Backward filter algorithm: 1

And this is the traceback: Traceback (most recent call last): File "/home/wangpeiyu/anaconda3/envs/nndetection/bin/nndet_prep", line 33, in sys.exit(load_entry_point('nndet', 'console_scripts', 'nndet_prep')()) File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/utils/check.py", line 62, in wrapper return func(*args, **kwargs) File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 423, in main run(OmegaConf.to_container(cfg, resolve=True), File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 352, in run run_planning_and_process( File "/home/wangpeiyu/nndetection/nnDetection-main/scripts/preprocess.py", line 179, in run_planning_and_process plan_identifiers = planner.plan_experiment( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/experiment/v001.py", line 43, in plan_experiment plan_3d = self.plan_base_stage( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/experiment/base.py", line 234, in plan_base_stage architecture_plan = architecture_planner.plan( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/c002.py", line 127, in plan res = super().plan( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/base.py", line 346, in plan patch_size = self._plan_architecture( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/architecture/boxes/c002.py", line 205, in _planarchitecture , fits_in_mem = self.estimator.estimate( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 128, in estimate res = self._estimate_mem_available( File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 155, in _estimate_mem_available fixed, dynamic = self.measure(shape=target_shape, File "/home/wangpeiyu/nndetection/nnDetection-main/nndet/planning/estimator.py", line 253, in measure network.cpu() File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 714, in cpu return self._apply(lambda t: t.cpu()) File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply module._apply(fn) File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply module._apply(fn) File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 574, in _apply module._apply(fn) [Previous line repeated 4 more times] File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 597, in _apply param_applied = fn(param) File "/home/wangpeiyu/anaconda3/envs/nndetection/lib/python3.9/site-packages/torch/nn/modules/module.py", line 714, in return self._apply(lambda t: t.cpu()) RuntimeError: CUDA error: an illegal memory access was encountered

mibaumgartner commented 6 months ago

I tried to generalise the error handling a little bit, can you please pull the latest github version and try again (make sure that you have nndetection installed with the -e option to the changes also take place).

karon999 commented 6 months ago

okay, I will download the latest nndetection and have a try.

karon999 commented 6 months ago

Thanks for the suggestion, I've updated to the latest version. But something wierd happens. When I select about 50 nii images from the entire dataset to put into ImagesTr, the preprocess works fine. However, when around 100 images are put in, the error (RuntimeError: CUDA error: an illegal memory access was encountered) still occurs. I've looked into it, and it could be because the batchsize or imagesize is too big for the memory (I'm using a gpu with 16G memory), do you think this is a valid reason? If so, how can I modify it? Thanks again for your help, I will be very appreciate it if you could help to solve this problem. Best, Karon

karon999 commented 6 months ago

Or is there a possibility that I could preprocess a portion of the dataset at a time and combine them at the end, and if so, which files would I need to make changes to?

mibaumgartner commented 6 months ago

Sorry for the delay, I needed to take care of several deadlines.

Unfortunately it is not possible to run the preprocessing on parts of the dataset since the properties need to be extracted from the entire dataset first.

There is a loop which iteratively reduces the patch size until it fits into the memory, due to some reason, the error which is raised during this process (the Our of Memory error) is not catched correctly und thus the whole program crashes. One solution which comes to my mind, would be to modify these lines https://github.com/MIC-DKFZ/nnDetection/blob/8032b8d2d4366ce0c3812ce96f747b443fa2f080/nndet/planning/architecture/boxes/c002.py#L299 manually and reduce it to something that is slightly larger than the final patch size.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 30 days with no activity.

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

MIC-DKFZ / nnDetection

ERROR while preprocessing #231