I found that "param_norm" grows very quickly when training in "ve" mode. It is easy to overflow when using mixed precision and small batchsize. I wonder if authors has encountered this question and how to solve it.
I have not encountered this. But this can be related to time sampling that's too close to boundaries, e.g. 0 or T, as there are singularities at these points that can cause numerical errors.
I found that "param_norm" grows very quickly when training in "ve" mode. It is easy to overflow when using mixed precision and small batchsize. I wonder if authors has encountered this question and how to solve it.