Haiyang-W / TokenFormer

Official Implementation of TokenFormer: Rethinking Transformer Scaling with Tokenized Model Parameters
https://haiyang-w.github.io/tokenformer.github.io/
Apache License 2.0
386 stars 24 forks source link

Softmax #9

Open kroggen opened 2 weeks ago

kroggen commented 2 weeks ago

In your code there is this:

        if normalize_type == 'softmax': 
            # NOTE: softmax = exp_l1_norm
            # outputs = F.softmax(inputs, dim=dim) * inputs.shape[dim]
            nonlinear_outputs = torch.exp(inputs)
            norm_outputs = nonlinear_outputs / torch.norm(nonlinear_outputs, p=1, dim=dim, keepdim=True) * inputs.shape[dim]
            outputs = norm_outputs

It turns out that softmax uses a division by the sum of exponentials:

softmax(x)_i = exp(x_i) / sum(exp(x_j))

But your code is using the sum of the absolute values.

The sum consider the sign of negative values:

 sum([-2, 1, 3]) = -2 + 1 + 3 = 2

While the L1 norm does not:

  L1([-2, 1, 3]) = |-2| + |1| + |3| = 2 + 1 + 3 = 6

The comment should be modified softmax = exp_l1_norm

It is also multiplying by the token dimension, on this case the sum of the attention scores is not 1.

Haiyang-W commented 2 weeks ago

Thanks for your careful checking. Acturally, after exp(), the value is positive, so softmax is equal to exp_l1_norm. I'm only multiplying by 'inputs.shape[dim]' here to balance the variance, so that we can achieve relatively good performance. If you remove this, the performance will be really bad.

Haiyang-W commented 2 weeks ago

Directly using Softmax without scaling by token dimension, the std will be very low, and the performance will be very poor.