RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

munib94 commented 4 years ago

I'm having this error when training. My system configurations are listed below.

OS: CentOS Linux 7 PyTorch version: 1.1.0 (installed using conda) TensorFlow-gpu version: 1.9.0 Python version: 3.6.8 CUDA/cuDNN version: 9.0/7.0.5 GPU: Nvidia GPU GeForce GTX 1080

I modified the network so that it has an input depth of 8 instead of 5 and noticed that this issue only appears when the depth is greater than 5. I can't figure out how to resolve the error message though.

>auto::operator()(int)->auto: block: [1390,0,0], thread: [64,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)-
...
>auto::operator()(int)->auto: block: [1390,0,0], thread: [125,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)-
>auto::operator()(int)->auto: block: [1390,0,0], thread: [126,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)-
>auto::operator()(int)->auto: block: [1390,0,0], thread: [127,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Traceback (most recent call last):
  File "./train.py", line 115, in <module>
    trainer.train()
  File "../../tasks/semantic/modules/trainer.py", line 239, in train
    show_scans=self.ARCH["train"]["show_scans"])
  File "../../tasks/semantic/modules/trainer.py", line 320, in train_epoch
    output = model(in_vol, proj_mask)
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../../tasks/semantic/modules/segmentator.py", line 149, in forward
    y, skips = self.backbone(x)
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../..//backbones/darknet.py", line 171, in forward
    x, skips, os = self.run_layer(x, self.conv1, skips, os)
  File "../..//backbones/darknet.py", line 154, in run_layer
    y = layer(x)
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

Any ideas on how to fix this?

tano297 commented 4 years ago

Can you try the unmodified version first? It already supports single-GPU training. Let me know if you get the same error

munib94 commented 4 years ago

I've pinpointed the error to the input depth. With the input depth at <= 5, the network trains as expected. If I modify it to 8 instead of 5 then I get the error. When I trained with input depth 8 and multiple GPUs enabled, I get a different error:

>auto::operator()(int)->auto: block: [1367,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1556653183467/work/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7f1ad3ebcdc5 in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x14792 (0x7f1ad0d14792 in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x50 (0x7f1ad3eac640 in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x3067fb (0x7f1ad14337fb in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libtorch.so.1)
frame #4: <unknown function> + 0x13ff1b (0x7f1af9c9cf1b in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x3bf384 (0x7f1af9f1c384 in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: <unknown function> + 0x3bf3d1 (0x7f1af9f1c3d1 in /home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #23: __libc_start_main + 0xf5 (0x7f1b0eec4445 in /lib64/libc.so.6)

Aborted (core dumped)

When I tried training again with an input depth of 8 and single GPU, I got a THCudaCheck FAIL error message in addition to the above error message:

>auto::operator()(int)->auto: block: [1387,0,0], thread: [94,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/ATen/native/cuda/IndexKernel.cu:53: lambda [](int)->auto::operator()(int)->auto: block: [1387,0,0], thread: [95,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1556653183467/work/aten/src/THC/THCCachingHostAllocator.cpp line=265 error=59 : device-side assert triggered
Traceback (most recent call last):
  File "./train.py", line 115, in <module>
    trainer.train()
  File "../../tasks/semantic/modules/trainer.py", line 238, in train
    show_scans=self.ARCH["train"]["show_scans"])
  File "../../tasks/semantic/modules/trainer.py", line 319, in train_epoch
    output = model(in_vol, proj_mask)
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../../tasks/semantic/modules/segmentator.py", line 149, in forward
    y, skips = self.backbone(x)
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "../..//backbones/darknet.py", line 171, in forward
    x, skips, os = self.run_layer(x, self.conv1, skips, os)
  File "../..//backbones/darknet.py", line 154, in run_layer
    y = layer(x)
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet++/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 338, in forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

tano297 commented 4 years ago

Hi, depth 5 is for now a constant. It means using range, x, y, z, and remission as inputs. Leave the code for the trainer untouched, as it already supports single and multigpu training

munib94 commented 4 years ago

I added three more input channels to account for the extra data I am providing. Does this mean the network does not support input depths greater than 5?

tano297 commented 4 years ago

No, it should be fine if you modify the code properly. Where did you modify it? If you share your mods with me maybe I can point you to where you're missing something/doing something wrong

munib94 commented 4 years ago

Sorry for the delay, I was gathering my edits into a single file. I sent you an email at your uni-bonn.de email address listed on you github profile.

munib94 commented 4 years ago

Not sure if you reviewed my code, but do you think it is a CUDA/cuDNN compatibility issue? Which versions did you use?

tano297 commented 4 years ago

Hi, I have not looked at the email yet, I will do so now and let you know. If there is anything to be learned for others we will post it here

tano297 commented 4 years ago

You're on the right track. You need to modify this line

From:

   proj = torch.cat([proj_range.unsqueeze(0).clone(),
                      proj_xyz.clone().permute(2, 0, 1),
                      proj_remission.unsqueeze(0).clone()])

to:

   proj = torch.cat([proj_range.unsqueeze(0).clone(),
                      proj_xyz.clone().permute(2, 0, 1),
                      proj_remission.unsqueeze(0).clone(),
                      proj_extra_param.clone().permute(2,0,1)])

To include the extra parameters into the volume going into the CNN. You will need to modify the normalization parameters to have 8 parameters as well. You should do print(proj.shape) after that to check that the shape is [8,H,W]

munib94 commented 4 years ago

Thanks! For the normalization parameters, are you referring to the batchnorm.py file? I noticed that the SyncBatchNorm3D class applies batch normalization over a 5d input. Is this what I need to modifiy to work for an 8d input?

tano297 commented 4 years ago

Hi,

No, I mean this line.

It uses the means and stds over the input data to normalize the inputs. The 5d values are at the end of each config yaml file. You will need to add your 3 values at the end of each, otherwise that line will fail. Otherwise, your code should run fine. But I haven't tested it myself, it just looks properly modified in the right places.

munib94 commented 4 years ago

It works! Thank you! Just a final question, I noticed that the means and stds only go up to two decimal places. Is there a significant difference in accuracy/performance if more significant figures are used?

munib94 commented 4 years ago

Update: ran into an error after the first epoch:

Best mean iou in training set so far, save model!
********************************************************************************
Traceback (most recent call last):
  File "./train.py", line 115, in <module>
    trainer.train()
  File "../../tasks/semantic/modules/trainer.py", line 261, in train
    save_scans=self.ARCH["train"]["save_scans"])
  File "../../tasks/semantic/modules/trainer.py", line 408, in validate
    for i, (in_vol, proj_mask, proj_labels, _, path_seq, path_name, _, _, _, _, _, _, _, _, _, _, _) in enumerate(val_loader):
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 582, in __next__
    return self._process_next_batch(batch)
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 608, in _process_next_batch
    raise batch.exc_type(batch.exc_msg)
AttributeError: Traceback (most recent call last):
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in _worker_loop
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "/home/media-server/.pyenv/versions/anaconda3-5.0.0/envs/rangenet/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 99, in <listcomp>
    samples = collate_fn([dataset[i] for i in batch_indices])
  File "../..//tasks/semantic/dataset/kitti/parser.py", line 224, in __getitem__
    proj_extra_param.clone().permute(2, 0, 1)])
AttributeError: 'list' object has no attribute 'clone'

Not sure why this happened. Any ideas?

tano297 commented 4 years ago

the message is pretty clear there, you're trying to run the pytorch clone method, which only applies to torch tensors to a list object. Make sure that your data is in the proper format (as a torch tensor) and not as a python list

munib94 commented 4 years ago

I'm confused because I already had the following code to convert from a NumPy array to a Torch tensor in the parser.py file:

if self.extra_param:
   proj_extra_param = torch.from_numpy(scan.proj_extra_param).clone()
else:
   proj_extra_param = []

The network has no problem training the first epoch, but when it tries training for the second epoch, the error appears.

jbehley commented 3 years ago

I close this issue since there seems not to be much activity here or the problem resolved. If you still have problems, please re-open the issue.

PRBonn / lidar-bonnetal

RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR #46