MIC-DKFZ / nnUNet

Apache License 2.0
5.9k stars 1.76k forks source link

function y_onehot.scatter_(1, gt, 1) and RuntimeError: CUDA error: device-side assert triggered #1435

Closed DSRajesh closed 9 months ago

DSRajesh commented 1 year ago

Hello
I was training a segmentation model on this m/c :

NVIDIA GeForce RTX 4090, CUDA Version: 12.1, pytorch version 2.0.0+cu118.

Encountered the following error log while training (after the proprocessor phase).

The error log had several lines like this below -------------

../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [320,0,0], thread: [77,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [320,0,0], thread: [78,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [69,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [70,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [72,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [73,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [74,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [75,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [69,0,0], thread: [68,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [69,0,0], thread: [69,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

and then

Traceback (most recent call last): File "/home/vayu/.local/bin/nnUNet_train", line 8, in sys.exit(main()) File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/run/run_training.py", line 151, in main trainer.run_training() File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/network_training/nnUNetTrainerV2.py", line 431, in run_training ret = super().run_training() File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/network_training/nnUNetTrainer.py", line 316, in run_training super(nnUNetTrainer, self).run_training() File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/network_training/network_trainer.py", line 446, in run_training l = self.run_iteration(self.tr_gen, True) File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/network_training/nnUNetTrainerV2.py", line 247, in run_iteration l = self.loss(output, target) File "/home/vayu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/loss_functions/deep_supervision.py", line 39, in forward l = weights[0] self.loss(x[0], y[0]) File "/home/vayu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/loss_functions/dice_loss.py", line 350, in forward dc_loss = self.dc(net_output, target, loss_mask=mask) if self.weight_dice != 0 else 0 File "/home/vayu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/loss_functions/diceloss.py", line 182, in forward tp, fp, fn, = get_tp_fp_fn_tn(x, y, axes, loss_mask, False) File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/loss_functions/dice_loss.py", line 132, in get_tp_fp_fn_tn yonehot.scatter(1, gt, 1) RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception in thread Thread-5 (results_loop): Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/vayu/.local/lib/python3.10/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print" RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

Exception in thread Thread-4 (results_loop): Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/vayu/.local/lib/python3.10/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print" RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f28590bb4d7 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f285908536b in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f282ed3fb58 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x1250cee (0x7f27c2d2dcee in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0x4d59f6 (0x7f28283789f6 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #5: + 0x3ee77 (0x7f28590a0e77 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f285909969e in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f28590997b9 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #8: + 0x75afa8 (0x7f28285fdfa8 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #9: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7f28285fe335 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so) frame #10: + 0x1388e1 (0x55f33fbde8e1 in /usr/bin/python3) frame #11: + 0x1386dc (0x55f33fbde6dc in /usr/bin/python3) frame #12: + 0x153090 (0x55f33fbf9090 in /usr/bin/python3) frame #13: + 0x166918 (0x55f33fc0c918 in /usr/bin/python3) frame #14: + 0x166945 (0x55f33fc0c945 in /usr/bin/python3) frame #15: + 0x166945 (0x55f33fc0c945 in /usr/bin/python3) frame #16: + 0x166945 (0x55f33fc0c945 in /usr/bin/python3) frame #17: + 0x166945 (0x55f33fc0c945 in /usr/bin/python3) frame #18: + 0x166945 (0x55f33fc0c945 in /usr/bin/python3) frame #19: + 0x12a40f (0x55f33fbd040f in /usr/bin/python3) frame #20: PyDict_SetItemString + 0xa3 (0x55f33fbd44f3 in /usr/bin/python3) frame #21: + 0x268ba7 (0x55f33fd0eba7 in /usr/bin/python3) frame #22: Py_FinalizeEx + 0x176 (0x55f33fd0b4f6 in /usr/bin/python3) frame #23: Py_RunMain + 0x173 (0x55f33fcfb193 in /usr/bin/python3) frame #24: Py_BytesMain + 0x2d (0x55f33fcd132d in /usr/bin/python3) frame #25: + 0x29d90 (0x7f2859ac0d90 in /lib/x86_64-linux-gnu/libc.so.6) frame #26: __libc_start_main + 0x80 (0x7f2859ac0e40 in /lib/x86_64-linux-gnu/libc.so.6) frame #27: _start + 0x25 (0x55f33fcd1225 in /usr/bin/python3)

I saw a similar issue being discussed in https://github.com/MIC-DKFZ/nnUNet/issues/1419

If anyone could help with this issue, it would be great

Rajesh

xwjBupt commented 1 year ago

Hi, did you solved your problem?

DSRajesh commented 1 year ago

No. Temporarily shifted work to a 3090 mc

shenbw99 commented 1 year ago

No. Temporarily shifted work to a 3090 mc

Hi, did shift work to 3090 worked?

DSRajesh commented 1 year ago

Hi, now its working on both systems. We started training using small size datasets. Gradually increased the size. The problem never repeated. Also we repeated some old trainings which we had done in the past, it worked. Also some geometry mismatch of training images and labels was found.

moeinheidari7829 commented 6 months ago

Hi, now its working on both systems. We started training using small size datasets. Gradually increased the size. The problem never repeated. Also we repeated some old trainings which we had done in the past, it worked. Also some geometry mismatch of training images and labels was found.

Hi, could you please clarify on how you tried smaller version training with nnUNet? and also how you found the geometry problems of the data? I am facing some similar issues with my data and don't know where the problem is!