Closed zzzyzh closed 1 year ago
And I use RTX3090, python = 3.8.16, pytorch = 1.12.0, py3.8_cuda11.3_cudnn8.3.2_0
Hi, the error "RuntimeError: Unable to find a valid cuDNN algorithm to run convolution" is referred to your GPU memory. Seems like something is weird about your input data: "operator(): block: [924,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed", please check if your data input size is 96x96x96.
I have checked the dimensions of my data and they all match the input network.
My dataset is a set of png stacked into .nii.gz and then resampled to match the input requirements. I don't know if there are any other requirements for the dataset other than the size of the photos should be larger than (96,96,96)
I believe you may need to take a look into the load_datasets_transforms.py and see if the train transform fit your dataset or not. Originally, the transform for training sample is to crop 96x96x96 patches from the original image. Therefore, in fact you don't need to resample the input dimensions into 96x96x96. It will be great if your data input is like 512x512x~100 something like this (If it is CT).
Yes, my data input is like 200x200x100. I set the crop_sample from 2 to 1 and this error disappears. But there is another error occurs like this
/opt/conda/conda-bld/pytorch_1656352465323/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [7220,0,0], thread: [3,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.
Validate (0 / 10 Steps) (dice=0.28939): 6%|▋ | 28/435 [01:14<17:57, 2.65s/it]
Training (500 / 8000 Steps) (loss=0.90021): 24%|██▍ | 500/2064 [06:33<20:29, 1.27it/s]
Traceback (most recent call last):
File "main_train.py", line 276, in
Yes, it is the same error as the above in fact. I am wondering whether you have resampled your data to 1.0 x 1.0 x 1.2 resolution or not. Seems like the index out of bounds when we crop the samples for input.
Do you mean to resample the existing dataset to the original (1, 1, 1.2)?
Yes, because my transform for training datasets have a function to resample the samples into (1, 1, 1.2) resolution, in which we ensure that the samples have enough foreground to crop patches.
So you mean that the minimum image size of the input network is (x,y,z) = (96, 96, 96x1.2)?
So every image slice has its corresponding thickness when they are reconstructed, please take a look of your data and see what is the resolution of your image samples. For example, a CT image has a size of 512 x 512 x 107 with the resolution of 0.8x0.8x1.5.
I think that one of my CT image has a size of 512 x 512 x 29 with the spacing of 0.4 x 0.4 x 5
Thank you for your patience. I've managed to get it up and running.
hi, I met the the same problem as you, could you tell me how to fix the problem with RuntimeError: Unable to find a valid cuDNN algorithm to run convolution?
Hi! I find there is some problem when I use my own dataset. The first time I trained for about 33 steps the program was killed but no exceptions were thrown. When I try to restart the training, I get the following error: /opt/conda/conda-bld/pytorch_1656352465323/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [924,0,0], thread: [31,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.Training (109 / 10000 Steps) (loss=1.36912): 5%|▌ | 110/2064 [01:40<29:45, 1.09it/s] Traceback (most recent call last): File "main_train.py", line 261, in
global_step, dice_val_best, global_step_best = train(
File "main_train.py", line 208, in train
loss.backward()
File "/root/miniconda3/envs/uxnet3d/lib/python3.8/site-packages/torch/_tensor.py", line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/root/miniconda3/envs/uxnet3d/lib/python3.8/site-packages/torch/autograd/init.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution