What is the purpose of exp_scaling?

rajibrhasan commented 2 months ago

There are exp_scaling in LeakyAvg, KeyFeatureExtractor, ValueFeatureExtractor in mosaic_model.py under 'nanoMosaic' directory. What purspose does these exp_scaling serve?

niklasnolte commented 2 months ago

they amplify the gradients of the scaling parameters in the exponential, something we have found to help in practice.

rajibrhasan commented 2 months ago

class LeakyAvg(nn.Module):
    def __init__(self, block_size, n_head):
        super().__init__()
        coef = torch.zeros(block_size, block_size)
        for i in range(block_size):
            coef = torch.diagonal_scatter(coef, -torch.ones(block_size-i)*i, -i)
        self.register_buffer('coef', coef)
        self.exp_scaling = 10
        self.leaky_key_beta = nn.Parameter(torch.linspace(0.5, 5, n_head).view(1, n_head, 1, 1)/self.exp_scaling)

    def forward(self, k):
        B, nh, T, hs = k.size()
        leaky_key_beta = self.leaky_key_beta.abs() * self.exp_scaling
        coef = self.coef[:T,:T].view(1,1,T,T)
        coef = torch.exp(coef * leaky_key_beta)
        return coef.tril() @ k

Thank you for the response. In the above code, while initializing self.leaky_key_beta you are dividing it by self.exp_scaling. Then in the forward call, you are multiplying it by self.exp_scaling. This does not make sense as it renders the original self.leaky_key_beta. But you are saying this is purely for amplifying the gradients of the scaling parameters in the exponential. I have another question. Any particular reason for initializing self.leaky_key_beta as torch.linspace between 0.5 and 5?

TjuJianyu commented 2 months ago

It amplifies gradients. You can compare the gradient on the following two equations for example: $y_ = (\alpha x)^2_1$ and $y_2 = x^2_2$, where $x_1$ is initialized to be $x_2 / \alpha$ (small), $\alpha > 1$. Then $y_1'=2\alpha^2 x_1$ and $y_2' = 2x_2$. Comparing $y_1'$ and $y_2'$, you will find that $y_1' = \alpha y_2' > y_2'$. Thus, this reparameterization trick amplifies gradients.

$e^{- leaky Key Beta}$ is a decay factor in leaky average (Thus it falls into (0,1)). Different heads in Memory Mosaics require different decay factors to summarize information in different resolutions. This initialization provides a range of decay factors $e^{- leaky Key Beta}$ (check coef) from ~0 to ~0.6.

rajibrhasan commented 2 months ago

Thank you once again for the clarification.

facebookresearch / MemoryMosaics

What is the purpose of exp_scaling? #3