MIC-DKFZ / nnUNet

Apache License 2.0
5.59k stars 1.71k forks source link

CUDA error #2145

Closed moeinheidari7829 closed 1 day ago

moeinheidari7829 commented 4 months ago

Hi, I am trying to run nnUNet on a custom dataset with the following command: nnUNetv2_train 050 3d_lowres 2

however, I get this error and I do not know where it comes from:

Traceback (most recent call last): File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/run/run_training.py", line 247, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/run/run_training.py", line 190, in run_training nnunet_trainer.run_training() File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1210, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 856, in train_step self.grad_scaler.step(self.optimizer) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/amp/grad_scaler.py", line 453, in step retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/amp/grad_scaler.py", line 350, in _maybe_opt_step if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()): File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/amp/grad_scaler.py", line 350, in if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()): RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

mrokuss commented 4 months ago

Hey @moeinheidari7829

It's hard to tell what's going wrong from this error log. Could you rerun with

CUDA_LAUNCH_BLOCKING=1 nnUNetv2_train 050 3d_lowres 2

moeinheidari7829 commented 4 months ago

Hey @moeinheidari7829

It's hard to tell what's going wrong from this error log. Could you rerun with

CUDA_LAUNCH_BLOCKING=1 nnUNetv2_train 050 3d_lowres 2

Hi, thank you for your response, this is the error I get now:

Traceback (most recent call last): File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/run/run_training.py", line 247, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/run/run_training.py", line 190, in run_training nnunet_trainer.run_training() File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1210, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 850, in train_step l = self.loss(output, target) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, *kwargs) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/training/loss/deep_supervision.py", line 30, in forward l = weights[0] self.loss([j[0] for j in args]) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, kwargs) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/training/loss/compound_losses.py", line 51, in forward dc_loss = self.dc(net_output, target_dice, loss_mask=mask) \ File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(args, kwargs) File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/training/loss/dice.py", line 95, in forward yonehot.scatter(1, gt, 1) RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

mrokuss commented 4 months ago

Hey @moeinheidari7829

There seems to be something going wrong in the Dice loss calculation, particularly when the labels are being brought into a one hot encoding. I suspect something is wrong with the label files. You could try running the network on CPU instead of GPU to get a more precise error and also manually inspect the label files and if they match the classes provided in the dataset.json. Is fold 0 or the folder folds working fine and does the error just appear at fold 2?

nnUNetv2_train 050 3d_lowres 2 -device cpu

mrokuss commented 1 day ago

Closing due to inactivity, feel free to reopen.