Why LayerNorm before conv in downsampling layers ?

Thanks for your awesome work!

While stem is coherent in regard to Blocks where we have the ordering conv->norm, in dowsampling layers you put LayerNorm before convolution.

The full path is:

Which means that if residual stage 1 converges to identity, we have a layernorm into a layernorm which seems weird to me:

Can you explain this design choice ?

facebookresearch / ConvNeXt