Open xixixihean opened 8 months ago
I got the same issue. For me it always happens at around epoch ~120.
Here is my stacktrace:
"/dkfz/cluster/gpu/data/OE0441/t006d/Code/transunet3d/nn_transunet/trainer/nnUNetTrainerV2_DDP.py", line 1039, in run_training
l = self.run_iteration(self.tr_gen, True)
File "/dkfz/cluster/gpu/data/OE0441/t006d/Code/transunet3d/nn_transunet/trainer/nnUNetTrainerV2_DDP.py", line 552, in run_iteration
l = self.compute_loss(output, target, is_max, is_c2f, self.args.is_sigmoid, is_max_hungarian, is_max_ds, point_rend, num_point_rend, no_object_weight)
File "/dkfz/cluster/gpu/data/OE0441/t006d/Code/transunet3d/nn_transunet/trainer/nnUNetTrainerV2_DDP.py", line 658, in compute_loss
output_act = output_ds[i].sigmoid() if is_sigmoid else softmax_helper(output_ds[i]) # bug occurs here..
(Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:113.)
In nnUNetTrainerV2_DDP.py, with autocast(enabled=False): output_act = output_ds[i].sigmoid() if is_sigmoid else softmax_helper(output_ds[i]) # bug occurs here.. There will raise an error:RuntimeError: Function 'SigmoidBackward0' returned nan values in its 0th output. Could you tell me how to solve it,please? Thank you.