Open ramonhollands opened 1 year ago
this error occurs when
for name, param in self.model.named_parameters():
if param.grad is not None and torch.isnan(param.grad).any():
print("nan gradient name:", name)
If it still cannot be solved and your custom dataset is shareable, you can send your dataset to my email(bitshliu@qq.com) for testing, and I will reply to you before the end of this week as far as possible.
Thanks for your fast reply.
It looks like it's training correctly after changing the fp16 setting to false again. Could that be the cause? First 30 epochs have succeeded so far. I'll keep you posted.
what type is your GPU? I guess that your GPU might not fully support AMP(auto-mixed-precision) training, I test training with fp16 enabled and it works well in RTX30 sesries and Tesla T4
It's a RTX3090, running CUDA 11.4. I am doing some experiments these days and let you know. If you are interested in this dataset, I can share it with you.
Looking forward to eventually contribute to this repo.
oh, it might because of the CUDA version. l'm using CUDA 11.1 for RTX3090 in ubuntu18.04. And I'd like to know more about your dataset and your experiments
Thanks for your fast reply.
It looks like it's training correctly after changing the fp16 setting to false again. Could that be the cause? First 30 epochs have succeeded so far. I'll keep you posted.
I have the same problem after seting the fp16 to true.
Hi Shihan Liu,
Thanks for your work on this repo!
I'm trying to run a custom training myself, using your train script and the yolo formats. The dataset seems to be fine, looking at demo/load_dataset.py. Any ideas where I can start my bug hunt, given above error?
Best regards, Ramon