Open arunmallya opened 4 months ago
I agree with you.
My version:
class RMSNorm(nn.Module):
def __init__(self,layer_shape,eps=1e-8,bias=False):
super(RMSNorm,self).__init__()
self.register_parameter('scale',nn.Parameter(torch.ones(layer_shape)))
self.eps=eps
def forward(self,x):
"""
assumes shape is (batch,seq_len,d_model)
"""
f = torch.rsqrt((torch.mean(pow(x,2),dim=-1,keepdim=True)+self.eps))
return x*f*self.scale[:x.shape[1],:].unsqueeze(0)
hi! open a PR?
The RMSNorm implementation in this codebase in wrong as it computes the RMS over the
(T, D)
dimensions instead of the(D)
dimension. Assume input x is of shape(B, T, D)
.The current code does this:
The original RMSNorm is here - https://github.com/meta-llama/llama/blob/main/llama/model.py#L34-L77
The correct version using Frobenius norm would be:
Normalization should be per-token, not per-sequence.