Closed lnexenl closed 5 months ago
Hi, we are using fp16 training with a gradscaler. The gradscaler should tale care of nan/infs.
So to answer your question, the nans should not affect the training. But let me know if you have issues.
Thank you so much for your reply. I guess it won't be a problem if nan/inf grads only exist once for every dozens of steps?
Exactly, but make sure youre using the gradscaler. Actually I had a lot of issues getting fp16 training to be stable, so let me know if you get any other issues.
I meet some backward issues when training:
Have you ever met such problem? I add epipolar error by estimating E matrix when training.
Is this the backward of the E solver?
I haven't tried myself, but perhaps if you save the inputs to the solver you may be able to find the issue.
Maybe you picked the same correspondence twice?
I find out the problem is caused by backward of torch.linalg.solve in E solver. I added an regularization item to A and solved the problem. Thank u a lot for your kindly reply.
I found that when preforming backward, there sometimes exists warnings like:
do these nan or inf grads have bad effects on training?