RuntimeError: CUDA error: an illegal memory access was encountered

amarzullo24 commented 2 years ago

Bug

When running the train_video.py example using the option --cuda I get the following error:

RuntimeError: CUDA error: an illegal memory access was encountered

To Reproduce

Steps to reproduce the behavior: When running: python train_video.py -d data --batch-size 16 -lr 1e-4 --cuda --save

I get the following error:

Traceback (most recent call last):
  File "/home/user/videocompression/main.py", line 473, in <module>
    main(sys.argv[1:])
  File "/home/user/videocompression/main.py", line 442, in main
    train_one_epoch(
  File "/home/user/videocompression/main.py", line 236, in train_one_epoch
    out_criterion["loss"].backward()
  File "/home/user/.conda/envs/compression/lib/python3.9/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/home/user/.conda/envs/compression/lib/python3.9/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
  File "/home/user/.conda/envs/compression/lib/python3.9/site-packages/torch/autograd/function.py", line 89, in apply
    return self._forward_cls.backward(self, *args)  # type: ignore
  File "/home/user/videocompression/compressai/compressai/layers/layers.py", line 293, in backward
    grad_input[input < 0] = grad_sub[input < 0]
RuntimeError: CUDA error: an illegal memory access was encountered

Expected behavior

The training completes without any issue.

Environment

Please copy and paste the output from python3 -m torch.utils.collect_env

Collecting environment information...
PyTorch version: 1.8.0+cu111
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.3 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.16.3

Python version: 3.9 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: NVIDIA A100-PCIE-40GB
  MIG 3g.20gb     Device  0:
GPU 1: NVIDIA A100-PCIE-40GB
GPU 2: NVIDIA A100-PCIE-40GB
  MIG 3g.20gb     Device  0:

Nvidia driver version: 470.57.02
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.2.4
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.2.4
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.21.2
[pip3] pytorch-msssim==0.2.1
[pip3] torch==1.8.0+cu111
[pip3] torchaudio==0.8.0
[pip3] torchvision==0.9.0+cu111
[conda] blas                      1.0                         mkl
[conda] cudatoolkit               10.2.89              hfd86e86_1
[conda] libblas                   3.9.0            12_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            12_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            12_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            12_linux64_mkl    conda-forge
[conda] mkl                       2021.4.0           h06a4308_640
[conda] mkl-service               2.4.0            py39h7f8727e_0
[conda] mkl_fft                   1.3.1            py39hd3c417c_0
[conda] mkl_random                1.2.2            py39h51133e4_0
[conda] numpy                     1.21.2           py39h20f2e39_0
[conda] numpy-base                1.21.2           py39h79a1101_0
[conda] pytorch-msssim            0.2.1                    pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torch                     1.8.0+cu111              pypi_0    pypi
[conda] torchaudio                0.8.0                    pypi_0    pypi
[conda] torchvision               0.9.0+cu111              pypi_0    pypi

Additional context

By carefully debugging the script I found that the bug might be caused by the flow variable created at video/google.py#L378 Indeed the access to this variable causes a CUDNN_not_initialized error and replacing it with a torch.rand_like(flow) tensor solves the error.

As a workaround, I temporary replaced that line as follows:

def forward_prediction(self, x_ref, motion_info):
        #FIXME this cause cuDNN error: CUDNN_STATUS_NOT_INITIALIZED. Maybe cudnn version problem
        #flow, scale_field = motion_info.chunk(2, dim=1)
        #TODO: possible workaround:
        flow, scale_field = motion_info[:,:2,:,:], motion_info[:,-1,:,:].unsqueeze(1)

        volume = self.gaussian_volume(x_ref, self.sigma0, self.num_levels)
        x_pred = self.warp_volume(volume, flow, scale_field) # volume, [gx,gy], gz
        return x_pred

Such workaround seems to work (the training starts and complete with no further errors), even I am not 100% sure that those two operations are equivalents.

I think this problem may be related to the cudnn intallation on the server I am using for the experiments. I wonder if someone else is experiencing a similar issue.

fracape commented 2 years ago

Hi, @emmeduz Thanks for this report. I can't reproduce form 2 envs:

zzzerow commented 2 years ago

I have got the same error. Upgrade the verision of pytorch to 1.11.0 solve this problem.

InterDigitalInc / CompressAI