CNDNN_ERROR ? - Githubissues

edwardcho commented 2 years ago

Hello Sir,

Using my-datasets, I tried to train your code. But I met CUDNN-ERROR.

...
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [374,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [374,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
Traceback (most recent call last):
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/train.py", line 40, in <module>
    trainer.run_generator_one_step(data_i)
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/trainers/pix2pix_trainer.py", line 35, in run_generator_one_step
    g_losses, generated = self.pix2pix_model(data, mode='generator')
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 166, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/models/pix2pix_model.py", line 46, in forward
    input_semantics, real_image)
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/models/pix2pix_model.py", line 136, in compute_generator_loss
    input_semantics, real_image, compute_kld_loss=self.opt.use_vae)
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/models/pix2pix_model.py", line 198, in generate_fake
    fake_image = self.netG(input_semantics, z=z)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/data1/TESTBOARD/additional_networks/generation/SelectionGAN_Ha0Tang/semantic_synthesis/models/networks/generator.py", line 90, in forward
    x = self.fc(x)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 443, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/home/itsme/anaconda3/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 440, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
terminate called after throwing an instance of 'c10::CUDAError'
  what():  CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:1055 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f44b3256a22 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x10aa3 (0x7f44b34b7aa3 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1a7 (0x7f44b34b9147 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0x54 (0x7f44b32405a4 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0xa2f382 (0x7f4558065382 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xa2f421 (0x7f4558065421 in /home/itsme/anaconda3/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #21: __libc_start_main + 0xe7 (0x7f455add0b97 in /lib/x86_64-linux-gnu/libc.so.6)

How to solve it??

Thanks, Edward Cho.

Ha0Tang commented 2 years ago

Please provide more information on how you train.

edwardcho commented 2 years ago

Hello sir. I tried to train image-to-image translation using your code.

My dataset is as follows:

Blur image vs clear image : paired image set, gray scale
Because I could not prepare labeled image, i think that blur image same to labeled image.

edwardcho commented 2 years ago

If i couldn't prepare "semantic labeled image", i can't use your code??

Ha0Tang commented 2 years ago

Did you successfully run my code with my dataset?

Ha0Tang commented 2 years ago

You can run the code without using "semantic labeled image".

Ha0Tang commented 2 years ago

How many channel dimensions is the blurred image? It should be 3, if not, you need to change the code.

davidvfx07 commented 1 year ago

I am having the same issue! When I use the ADE dataset images it trains with no issues but when I use my own, with the same bit depth, it gives me this error!

davidvfx07 commented 1 year ago

I think figured it out. It's basically an overload error I think. Decreasing the amount of images may help with that error. I don't know why a lower batch size still produces that error though, I now have to reduce my image count to just 25. This is code breaking. @Ha0Tang, please fix this or explain what I can be doing wrong.

Ha0Tang / SelectionGAN

CNDNN_ERROR ? #17