RunTimeError - Githubissues

zzzyzh commented 1 year ago

Hi! I find there is some problem when I use my own dataset. The first time I trained for about 33 steps the program was killed but no exceptions were thrown. When I try to restart the training, I get the following error: /opt/conda/conda-bld/pytorch_1656352465323/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [924,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

Training (109 / 10000 Steps) (loss=1.36912): 5%|▌ | 110/2064 [01:40<29:45, 1.09it/s] Traceback (most recent call last): File "main_train.py", line 261, in global_step, dice_val_best, global_step_best = train( File "main_train.py", line 208, in train loss.backward() File "/root/miniconda3/envs/uxnet3d/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs) File "/root/miniconda3/envs/uxnet3d/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

zzzyzh commented 1 year ago

And I use RTX3090, python = 3.8.16, pytorch = 1.12.0, py3.8_cuda11.3_cudnn8.3.2_0

leeh43 commented 1 year ago

Hi, the error "RuntimeError: Unable to find a valid cuDNN algorithm to run convolution" is referred to your GPU memory. Seems like something is weird about your input data: "operator(): block: [924,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed", please check if your data input size is 96x96x96.

zzzyzh commented 1 year ago

I have checked the dimensions of my data and they all match the input network.

zzzyzh commented 1 year ago

My dataset is a set of png stacked into .nii.gz and then resampled to match the input requirements. I don't know if there are any other requirements for the dataset other than the size of the photos should be larger than (96,96,96)

leeh43 commented 1 year ago

I believe you may need to take a look into the load_datasets_transforms.py and see if the train transform fit your dataset or not. Originally, the transform for training sample is to crop 96x96x96 patches from the original image. Therefore, in fact you don't need to resample the input dimensions into 96x96x96. It will be great if your data input is like 512x512x~100 something like this (If it is CT).

zzzyzh commented 1 year ago

Yes, my data input is like 200x200x100. I set the crop_sample from 2 to 1 and this error disappears. But there is another error occurs like this /opt/conda/conda-bld/pytorch_1656352465323/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [7220,0,0], thread: [3,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. Validate (0 / 10 Steps) (dice=0.28939): 6%|▋ | 28/435 [01:14<17:57, 2.65s/it] Training (500 / 8000 Steps) (loss=0.90021): 24%|██▍ | 500/2064 [06:33<20:29, 1.27it/s] Traceback (most recent call last): File "main_train.py", line 276, in global_step, dice_val_best, global_step_best = train( File "main_train.py", line 236, in train dice_val = validation(epoch_iterator_val) File "main_train.py", line 171, in validation val_labels_convert = [ File "main_train.py", line 172, in post_label(val_label_tensor) for val_label_tensor in val_labels_list File "/root/miniconda3/envs/uxnet3d/lib/python3.8/site-packages/monai/utils/deprecate_utils.py", line 217, in _wrapper return func(*args, kwargs) File "/root/miniconda3/envs/uxnet3d/lib/python3.8/site-packages/monai/utils/deprecate_utils.py", line 217, in _wrapper return func(*args, *kwargs) File "/root/miniconda3/envs/uxnet3d/lib/python3.8/site-packages/monai/utils/deprecate_utils.py", line 217, in _wrapper return func(args, kwargs) [Previous line repeated 1 more time] File "/root/miniconda3/envs/uxnet3d/lib/python3.8/site-packages/monai/transforms/post/array.py", line 242, in call img_t = one_hot(img_t, num_classes=to_onehot, dim=0) File "/root/miniconda3/envs/uxnet3d/lib/python3.8/site-packages/monai/networks/utils.py", line 97, in onehot labels = o.scatter(dim=dim, index=labels.long(), value=1) RuntimeError: CUDA error: device-side assert triggered

leeh43 commented 1 year ago

Yes, it is the same error as the above in fact. I am wondering whether you have resampled your data to 1.0 x 1.0 x 1.2 resolution or not. Seems like the index out of bounds when we crop the samples for input.

zzzyzh commented 1 year ago

Do you mean to resample the existing dataset to the original (1, 1, 1.2)?

leeh43 commented 1 year ago

Yes, because my transform for training datasets have a function to resample the samples into (1, 1, 1.2) resolution, in which we ensure that the samples have enough foreground to crop patches.

zzzyzh commented 1 year ago

So you mean that the minimum image size of the input network is (x,y,z) = (96, 96, 96x1.2)?

leeh43 commented 1 year ago

So every image slice has its corresponding thickness when they are reconstructed, please take a look of your data and see what is the resolution of your image samples. For example, a CT image has a size of 512 x 512 x 107 with the resolution of 0.8x0.8x1.5.

zzzyzh commented 1 year ago

I think that one of my CT image has a size of 512 x 512 x 29 with the spacing of 0.4 x 0.4 x 5

zzzyzh commented 1 year ago

Thank you for your patience. I've managed to get it up and running.

OCEANOUXIN commented 9 months ago

hi, I met the the same problem as you, could you tell me how to fix the problem with RuntimeError: Unable to find a valid cuDNN algorithm to run convolution?

MASILab / 3DUX-Net

RunTimeError #19