Open Labaien96 opened 2 years ago
Hi! I'm trying to use these sparse functions as an alternative to the softmax function in the attention mechanisms of transformers. However, the loss becomes NaN in the first iteration... Do you know what the reason can be?
Thanks in advance, Jokin.
I had the same issue.
Hi! I'm trying to use these sparse functions as an alternative to the softmax function in the attention mechanisms of transformers. However, the loss becomes NaN in the first iteration... Do you know what the reason can be?
Thanks in advance, Jokin.