Open nonick2k23 opened 3 weeks ago
Hi, you're right, the fallback is useless in this scenario, you can disable it. I've noticed this issue occasionally in my experiments, and I guess it's due to unstable predictions with Spynet in first epochs. Did you modify any parameters in the training script? Or synthetic generation code? Maybe using a smaller initial learning rate could help resolve the problem.
You have fallback code - which retrains an epoch if any batch fails due to NaNs in PSNR
But, it still doesn't work properly (ignoring that this fallback is not an optimal fix)
This is what I receive during training synthetic phase:
[train: 2, 550 / 1000] FPS: 23.2 (23.5) , Loss/total: nan , Loss/rgb: nan , Loss/raw/rgb: nan , Stat/psnr: 31.68984
Which seems to bypass your solution somehow - which means training cannot proceed.
Can you have a look or provide a solution for this issue.
Thank you