FloatingPointError: Loss is infinite or NaN

Thank you for your nice work! However, I encountered some issues when I try to run it. I haven't been able to solve this error for a long time so I have to ask you for help.

When the procedure go into "TextEncoder" of "CustomCLIP" for the second time, it makes this error: "FloatingPointError: Loss is infinite or NaN!"

I debugged this error and found that the problem was in the TextEncoder's transformer network: Before the input enters the first LayerNorm, there is no NaN. But after LayerNorm, the output appears NaN.

I have searched the solutions of this error. Someone says that it may be because float16 is not precise enough, causing overflow, and needs to be converted to float32. But your code is like this:

orig_type = x.dtype
ret = super().forward(x.type(torch.float32))
return ret.type(orig_type)

The input type has been converted to float32 before LayerNorm.

In addition, the input values are not large. They are all in the order of magnitude of about 1e-2.

So does anyone know why this error occurs?

KaiyangZhou / CoOp

FloatingPointError: Loss is infinite or NaN #73