Using sparsegen for attention probabilities

anirbanl / sparsegen

Code for the NeurIPS 2018 paper "On Controllable Sparse Alternatives to Softmax"

21 stars 5 forks source link

Using sparsegen for attention probabilities #1

Open Labaien96 opened 1 year ago

Labaien96 commented 1 year ago

Hi! I'm trying to use these sparse functions as an alternative to the softmax function in the attention mechanisms of transformers. However, the loss becomes NaN in the first iteration... Do you know what the reason can be?

Thanks in advance, Jokin.

maxxu05 commented 1 month ago

I had the same issue.