Open kroggen opened 2 weeks ago
Thanks for your careful checking. Acturally, after exp(), the value is positive, so softmax is equal to exp_l1_norm. I'm only multiplying by 'inputs.shape[dim]' here to balance the variance, so that we can achieve relatively good performance. If you remove this, the performance will be really bad.
Directly using Softmax without scaling by token dimension, the std will be very low, and the performance will be very poor.
In your code there is this:
It turns out that
softmax
uses a division by the sum of exponentials:But your code is using the sum of the absolute values.
The sum consider the sign of negative values:
While the L1 norm does not:
The comment should be
modified softmax = exp_l1_norm
It is also multiplying by the token dimension, on this case the sum of the attention scores is not 1.