there is no need to rewrite the 'class LayerNorm(nn.Module)'

REN-Yuke commented 2 years ago

The reason to rewrite the 'class LayerNorm(nn.Module)' is that you think the layer normal provided by PyTorch only supports 'channels_last' format (batch_size, height, width, channels), so you rewrite a new way to support 'channels_first' format (batch_size, channels, height, width). However, I found the F.layer_norm or nn.LayerNorm do not require the order of channels, height and width. Because F.layer_norm will derive the calculated dimensions from the last dim using 'normalized_shape' to calculate the mean and variance.

Specifically, the PyTorch implementation uses the every value in a image to calculate a pair of mean and variance, and every value in the image use this two numbers to do LayerNorm. But your implementation uses the values over channels in every spatial point to get a pair of mean and variance in every spatial point.

When I changed the following codes in convnext.py, I found I do the same thing as 'F.layer_norm' or 'nn.LayerNorm' by PyTorch. https://github.com/facebookresearch/ConvNeXt/blob/d1fa8f6fef0a165b27399986cc2bdacc92777e40/models/convnext.py#L119

u = a.mean([1, 2, 3], keepdim=True)
# u = x.mean(1, keepdim=True)  # original code
s = (x - u).pow(2).mean([1, 2, 3], keepdim=True)
# s = (x - u).pow(2).mean(1, keepdim=True)  # original code
x = self.weight[None, :] * x + self.bias[None, :]
# x = self.weight[:, None, None] * x + self.bias[:, None, None]  # original code

There is no need to rewrite the 'class LayerNorm(nn.Module)', it's just a misunderstanding about LayerNorm implementation.

REN-Yuke commented 2 years ago

Now I consider the 'layer normalization' in your code is different from the original layer normalization

The original layer normalization should be like this:

But all the 'layer normalization' in ConvNeXt is like this (it's looks more like kind of 'Depth-wise Normal', I think this name is more appropriate):

So you chose this kind of 'LayerNorm'(or 'Depth-wise Normal') for convenience of implementation by PyTorch?

liuzhuang13 commented 2 years ago

It's because the LayerNorm in Transformers generally only normalizes over only the channel dimension, without normalizing token/spatial dimensions, so we followed them.

I don't think the LN figure illustration in the GroupNorm paper represents the "original" LN. The original LN was developed in RNN and there was only a channel dimension without token/spatial dimension in each layer.

REN-Yuke commented 2 years ago

Thanks for your explanation. !(^_^)! Sorry for that I'm not familiar with NLP. Your explanation makes sense. Now I know why you rewrite the LayerNorm, it seems to be kind of difference between image data format and text data format, I agree.

But I think the regularization approach explained in GroupNorm paper may be more consistent with the name layer norm in image processing. The original LayerNorm paper named the way "Layer" because it uses all the inputs in a layer to compute two terms.

I wonder if this change will make results a little better or not. Anyway, it doesn't affect your wonferful job! (>_o)

liuzhuang13 commented 2 years ago

If I understand correctly, normalizing all C,H,W dimensions is equivalent to a GroupNorm with #groups=1. We haven't got a chance to try this though. The Poolformer paper uses this as their default

ppwwyyxx commented 2 years ago

FYI, LayerNorm paper's section 6.7 talks about CNNs. Although it does not clearly say how it is applied to (N, C, H, W), the words does have some hints:

With fully connected layers, all the hidden units in a layer tend to make similar contributions to the final prediction and re-centering and rescaling the summed inputs to a layer works well. However, the assumption of similar contributions is no longer true for convolutional neural networks. The large number of the hidden units whose receptive fields lie near the boundary of the image are rarely turned on and thus have very different statistics from the rest of the hidden units within the same layer.

My reading of it is that the "original" LayerNorm does normalize over (C, H, W) (and they think this might not be a good idea).

Although today in Transformer's point of view, H and W becomes "sequence" and then it becomes natural to normalize only on C dimension. And btw, "positional normalization" https://arxiv.org/pdf/1907.04312.pdf seem to be the first one to formally name such an operation for CNN.

facebookresearch / ConvNeXt

there is no need to rewrite the 'class LayerNorm(nn.Module)' #112