cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

jt551 commented 8 months ago

Hello,
I'm trying to run the sample notebook on a new laptop with Ubuntu 20.04, RTX2000 GPU, and nvidia-driver-535.
When trying to execute following section in samples.ipynb

Networks prediction for the segmentation

I get following error in the notebook immediately with model():

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-6-d742f71e52a2> in <module>
     11         # We rotate first the image
     12         rot_image = rot(image, 'tensor', forward)
---> 13         pred = model(rot_image)
     14         # We rotate prediction back
     15         pred = rot(pred, 'tensor', back)

~/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

/app/floortrans/models/hg_furukawa_original.py in forward(self, x)
    134 
    135     def forward(self, x):
--> 136         out = self.conv1_(x)
    137         out = self.bn1(out)
    138         out = self.relu1(out)

~/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

~/miniconda/envs/py36/lib/python3.6/site-packages/torch/nn/modules/conv.py in forward(self, input)
    318     def forward(self, input):
    319         return F.conv2d(input, self.weight, self.bias, self.stride,
--> 320                         self.padding, self.dilation, self.groups)
    321 
    322 

RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED

Terminal running docker shows:

THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1544174967633/work/aten/src/THC/THCGeneral.cpp line=405 error=8 : invalid device function

Could I get help to resolve this issue,
Thank you!

tippo00 commented 6 months ago

Hi, I got the same error as you when trying to run samples.ipynb and eval.py. Have you found a solution? Best regards

jt551 commented 6 months ago

No solution, works on Paperspace with older P4000 GPU as Pascal architecture is supported by cuDNN 7.6.5 (CUDA 9). https://docs.nvidia.com/deeplearning/cudnn/archives/cudnn-824/support-matrix/#cudnn-versions-764-765

https://docs.nvidia.com/cuda/ada-compatibility-guide/ suggested to try running with CUDA_FORCE_PTX_JIT=1 this produced the same error.

tippo00 commented 6 months ago

Hi, I got the sample notebook to work by running it on a newer version of CUDA (11.8) on my RTX 4070. I did this by first changing the docker file to:

FROM anibali/pytorch:2.0.1-cuda11.8-ubuntu22.04

# RUN sudo apt-get update
# RUN sudo apt-get upgrade -y
# RUN sudo apt-get install -y \
#         build-essential 

RUN sudo apt-get update \
 && sudo apt-get install -y libgl1-mesa-glx libgtk2.0-0 libsm6 libxext6 \
 && sudo rm -rf /var/lib/apt/lists/*

COPY requirements.txt /app/.

RUN pip install -r requirements.txt

And then changing the requirements.txt by removing the forced versions on all packages, adding opencv-python, and removing mkl-fft and mkl-random.

This lead to the error ValueError: A colormap named "rooms_furu" is already registered. in /floortrans/plotting.py which I fixed by changing line 610 in plotting.py to cmap3 = colors.ListedColormap(cpool, 'rooms_furu2').

I can now run the entirety of samples.ibynb without any errors, but I now get a different error when running eval.py.

$ python eval.py --weights model_best_val_loss_var.pkl
Traceback (most recent call last):                                              
  File "/app/eval.py", line 109, in <module>
    evaluate(args, log_dir, writer, logger)
  File "/app/eval.py", line 67, in evaluate
    things = get_evaluation_tensors(val, model, split, logger, rotate=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 176, in get_evaluation_tensors
    predicted_classes = polygons_to_tensor(
                        ^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 127, in polygons_to_tensor
    ten[pol_type['class'] + d][jj, ii] = 1
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
IndexError: index 521 is out of bounds for axis 0 with size 521

foojayyy commented 4 months ago

Have you resolved this issue？thank you！

Hi, I got the sample notebook to work by running it on a newer version of CUDA (11.8) on my RTX 4070. I did this by first changing the docker file to:
FROM anibali/pytorch:2.0.1-cuda11.8-ubuntu22.04

# RUN sudo apt-get update
# RUN sudo apt-get upgrade -y
# RUN sudo apt-get install -y \
#         build-essential 

RUN sudo apt-get update \
 && sudo apt-get install -y libgl1-mesa-glx libgtk2.0-0 libsm6 libxext6 \
 && sudo rm -rf /var/lib/apt/lists/*

COPY requirements.txt /app/.

RUN pip install -r requirements.txt
And then changing the requirements.txt by removing the forced versions on all packages, adding opencv-python, and removing mkl-fft and mkl-random.

This lead to the error ValueError: A colormap named "rooms_furu" is already registered. in /floortrans/plotting.py which I fixed by changing line 610 in plotting.py to cmap3 = colors.ListedColormap(cpool, 'rooms_furu2').

I can now run the entirety of samples.ibynb without any errors, but I now get a different error when running eval.py.
$ python eval.py --weights model_best_val_loss_var.pkl
Traceback (most recent call last):                                              
  File "/app/eval.py", line 109, in <module>
    evaluate(args, log_dir, writer, logger)
  File "/app/eval.py", line 67, in evaluate
    things = get_evaluation_tensors(val, model, split, logger, rotate=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 176, in get_evaluation_tensors
    predicted_classes = polygons_to_tensor(
                        ^^^^^^^^^^^^^^^^^^^
  File "/app/floortrans/metrics.py", line 127, in polygons_to_tensor
    ten[pol_type['class'] + d][jj, ii] = 1
    ~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^
IndexError: index 521 is out of bounds for axis 0 with size 521

tippo00 commented 4 months ago

Yes, I got it to work by changing 4 lines in floortrans/post_prosessing.py. From:

        polygon[:, 0] = np.clip(polygon[:, 0], 0, max_width)
        polygon[:, 1] = np.clip(polygon[:, 1], 0, max_height)

To:

        polygon[:, 0] = np.clip(polygon[:, 0], 0, max_width-1)
        polygon[:, 1] = np.clip(polygon[:, 1], 0, max_height-1)

And I did this change in two places. The first one around line 925 and the second around line 981. Hope this helps!

CubiCasa / CubiCasa5k

cuDNN error: CUDNN_STATUS_EXECUTION_FAILED #58

Networks prediction for the segmentation