Closed moeinheidari7829 closed 1 day ago
Hey @moeinheidari7829
It's hard to tell what's going wrong from this error log. Could you rerun with
CUDA_LAUNCH_BLOCKING=1 nnUNetv2_train 050 3d_lowres 2
Hey @moeinheidari7829
It's hard to tell what's going wrong from this error log. Could you rerun with
CUDA_LAUNCH_BLOCKING=1 nnUNetv2_train 050 3d_lowres 2
Hi, thank you for your response, this is the error I get now:
Traceback (most recent call last):
File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/bin/nnUNetv2_train", line 8, in TORCH_USE_CUDA_DSA
to enable device-side assertions.
Hey @moeinheidari7829
There seems to be something going wrong in the Dice loss calculation, particularly when the labels are being brought into a one hot encoding. I suspect something is wrong with the label files. You could try running the network on CPU instead of GPU to get a more precise error and also manually inspect the label files and if they match the classes provided in the dataset.json
. Is fold 0 or the folder folds working fine and does the error just appear at fold 2?
nnUNetv2_train 050 3d_lowres 2 -device cpu
Closing due to inactivity, feel free to reopen.
Hi, I am trying to run nnUNet on a custom dataset with the following command: nnUNetv2_train 050 3d_lowres 2
however, I get this error and I do not know where it comes from:
Traceback (most recent call last): File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/bin/nnUNetv2_train", line 8, in
sys.exit(run_training_entry())
File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/run/run_training.py", line 247, in run_training_entry
run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/run/run_training.py", line 190, in run_training
nnunet_trainer.run_training()
File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1210, in run_training
train_outputs.append(self.train_step(next(self.dataloader_train)))
File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 856, in train_step
self.grad_scaler.step(self.optimizer)
File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/amp/grad_scaler.py", line 453, in step
retval = self._maybe_opt_step(optimizer, optimizer_state, *args, **kwargs)
File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/amp/grad_scaler.py", line 350, in _maybe_opt_step
if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
File "/arc/project/st-ilker-1/moein/moein-envs/nn-env/lib/python3.8/site-packages/torch/amp/grad_scaler.py", line 350, in
if not sum(v.item() for v in optimizer_state["found_inf_per_device"].values()):
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.