anirbanl / sparsegen

Code for the NeurIPS 2018 paper "On Controllable Sparse Alternatives to Softmax"
22 stars 5 forks source link

Using sparsegen for attention probabilities #1

Open Labaien96 opened 2 years ago

Labaien96 commented 2 years ago

Hi! I'm trying to use these sparse functions as an alternative to the softmax function in the attention mechanisms of transformers. However, the loss becomes NaN in the first iteration... Do you know what the reason can be?

Thanks in advance, Jokin.

maxxu05 commented 3 months ago

I had the same issue.