Open DmitriiSavin opened 3 hours ago
It looks pretty similar to the following issue: https://github.com/Peterande/D-FINE/issues/3
It looks pretty similar to the following issue: #3
You can try resuming from NaN.pth in debugging mode. Random numerical overflows in AMP are often unavoidable. They can occur anywhere. You need to locate the place where the overflow occurs and then clamp it to the allowed value.
Hi! Thank you for your work on this project!
I'm training the s-model on a custom dataset, and I’ve encountered an issue after several successful epochs. Up until the 12th epoch, training seems to be progressing well: the loss is decreasing steadily, and validation metrics are improving. However, starting from the 12th epoch, the model begins producing NaN predictions, and the following code is triggered:
Here is how the CLI output looks like:
Here is the config:
Do you have any insights on why this might be happening?