Open yuanlonghui opened 2 years ago
This is the simplest way to prevent calculating log(0). And this is nesseceray when the embedding dim is large. When feature representations have very high dimensions, the maximum inner product is very likely to be the inner product with itself. After subtracting the maximum value, this will result in a lot of negative values in non-diagonal positions. This means that after exp(), it's very likely to be zero anywhere but the diagonal. In this case, since the diagonal position is not considered inside the log(), there is a probability that log(0) will be computed, resulting in nan.
Thank you very much for solving the problem that loss is NaN. Will your loss become higher and higher when you train? I look forward to your reply!
This is the simplest way to prevent calculating log(0). And this is nesseceray when the embedding dim is large. When feature representations have very high dimensions, the maximum inner product is very likely to be the inner product with itself. After subtracting the maximum value, this will result in a lot of negative values in non-diagonal positions. This means that after exp(), it's very likely to be zero anywhere but the diagonal. In this case, since the diagonal position is not considered inside the log(), there is a probability that log(0) will be computed, resulting in nan.
Thank you very much for solving the problem that loss is NaN. Will your loss become higher and higher when you train? I look forward to your reply!
Did you solve this?
This is the simplest way to prevent calculating log(0). And this is nesseceray when the embedding dim is large. When feature representations have very high dimensions, the maximum inner product is very likely to be the inner product with itself. After subtracting the maximum value, this will result in a lot of negative values in non-diagonal positions. This means that after exp(), it's very likely to be zero anywhere but the diagonal. In this case, since the diagonal position is not considered inside the log(), there is a probability that log(0) will be computed, resulting in nan.