Closed DSRajesh closed 9 months ago
Hi, did you solved your problem?
No. Temporarily shifted work to a 3090 mc
No. Temporarily shifted work to a 3090 mc
Hi, did shift work to 3090 worked?
Hi, now its working on both systems. We started training using small size datasets. Gradually increased the size. The problem never repeated. Also we repeated some old trainings which we had done in the past, it worked. Also some geometry mismatch of training images and labels was found.
Hi, now its working on both systems. We started training using small size datasets. Gradually increased the size. The problem never repeated. Also we repeated some old trainings which we had done in the past, it worked. Also some geometry mismatch of training images and labels was found.
Hi, could you please clarify on how you tried smaller version training with nnUNet? and also how you found the geometry problems of the data? I am facing some similar issues with my data and don't know where the problem is!
Hello
I was training a segmentation model on this m/c :
NVIDIA GeForce RTX 4090, CUDA Version: 12.1, pytorch version 2.0.0+cu118.
Encountered the following error log while training (after the proprocessor phase).
The error log had several lines like this below -------------
../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [320,0,0], thread: [77,0,0] Assertion
idx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [320,0,0], thread: [78,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [69,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [70,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [72,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [73,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [74,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [354,0,0], thread: [75,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [69,0,0], thread: [68,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed. ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:365: operator(): block: [69,0,0], thread: [69,0,0] Assertionidx_dim >= 0 && idx_dim < index_size && "index out of bounds"
failed.and then
Traceback (most recent call last): File "/home/vayu/.local/bin/nnUNet_train", line 8, in
sys.exit(main())
File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/run/run_training.py", line 151, in main
trainer.run_training()
File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/network_training/nnUNetTrainerV2.py", line 431, in run_training
ret = super().run_training()
File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/network_training/nnUNetTrainer.py", line 316, in run_training
super(nnUNetTrainer, self).run_training()
File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/network_training/network_trainer.py", line 446, in run_training
l = self.run_iteration(self.tr_gen, True)
File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/network_training/nnUNetTrainerV2.py", line 247, in run_iteration
l = self.loss(output, target)
File "/home/vayu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, kwargs)
File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/loss_functions/deep_supervision.py", line 39, in forward
l = weights[0] self.loss(x[0], y[0])
File "/home/vayu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(args, kwargs)
File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/loss_functions/dice_loss.py", line 350, in forward
dc_loss = self.dc(net_output, target, loss_mask=mask) if self.weight_dice != 0 else 0
File "/home/vayu/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/loss_functions/diceloss.py", line 182, in forward
tp, fp, fn, = get_tp_fp_fn_tn(x, y, axes, loss_mask, False)
File "/home/vayu/.local/lib/python3.10/site-packages/nnunet/training/loss_functions/dice_loss.py", line 132, in get_tp_fp_fn_tn
yonehot.scatter(1, gt, 1)
RuntimeError: CUDA error: device-side assert triggered
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception in thread Thread-5 (results_loop): Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/vayu/.local/lib/python3.10/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print" RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Exception in thread Thread-4 (results_loop): Traceback (most recent call last): File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner self.run() File "/usr/lib/python3.10/threading.py", line 953, in run self._target(*self._args, **self._kwargs) File "/home/vayu/.local/lib/python3.10/site-packages/batchgenerators/dataloading/multi_threaded_augmenter.py", line 92, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the print" RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message terminate called after throwing an instance of 'c10::Error' what(): CUDA error: device-side assert triggered Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f28590bb4d7 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const, char const, unsigned int, std::string const&) + 0x64 (0x7f285908536b in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10.so) frame #2: c10::cuda::c10_cuda_check_implementation(int, char const, char const, int, bool) + 0x118 (0x7f282ed3fb58 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10_cuda.so) frame #3: + 0x1250cee (0x7f27c2d2dcee in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0x4d59f6 (0x7f28283789f6 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x3ee77 (0x7f28590a0e77 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x1be (0x7f285909969e in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #7: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f28590997b9 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #8: + 0x75afa8 (0x7f28285fdfa8 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #9: THPVariable_subclass_dealloc(_object*) + 0x305 (0x7f28285fe335 in /home/vayu/.local/lib/python3.10/site-packages/torch/lib/libtorch_python.so)
frame #10: + 0x1388e1 (0x55f33fbde8e1 in /usr/bin/python3)
frame #11: + 0x1386dc (0x55f33fbde6dc in /usr/bin/python3)
frame #12: + 0x153090 (0x55f33fbf9090 in /usr/bin/python3)
frame #13: + 0x166918 (0x55f33fc0c918 in /usr/bin/python3)
frame #14: + 0x166945 (0x55f33fc0c945 in /usr/bin/python3)
frame #15: + 0x166945 (0x55f33fc0c945 in /usr/bin/python3)
frame #16: + 0x166945 (0x55f33fc0c945 in /usr/bin/python3)
frame #17: + 0x166945 (0x55f33fc0c945 in /usr/bin/python3)
frame #18: + 0x166945 (0x55f33fc0c945 in /usr/bin/python3)
frame #19: + 0x12a40f (0x55f33fbd040f in /usr/bin/python3)
frame #20: PyDict_SetItemString + 0xa3 (0x55f33fbd44f3 in /usr/bin/python3)
frame #21: + 0x268ba7 (0x55f33fd0eba7 in /usr/bin/python3)
frame #22: Py_FinalizeEx + 0x176 (0x55f33fd0b4f6 in /usr/bin/python3)
frame #23: Py_RunMain + 0x173 (0x55f33fcfb193 in /usr/bin/python3)
frame #24: Py_BytesMain + 0x2d (0x55f33fcd132d in /usr/bin/python3)
frame #25: + 0x29d90 (0x7f2859ac0d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #26: __libc_start_main + 0x80 (0x7f2859ac0e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: _start + 0x25 (0x55f33fcd1225 in /usr/bin/python3)
I saw a similar issue being discussed in https://github.com/MIC-DKFZ/nnUNet/issues/1419
If anyone could help with this issue, it would be great
Rajesh