Open F-Barto opened 2 years ago
Thanks for your awesome work!
While stem is coherent in regard to Blocks where we have the ordering conv->norm, in dowsampling layers you put LayerNorm before convolution.
The full path is:
conv2d 4x4, stride 4
layernorm
residual stage 1
conv2d 2x2, stride 2
residual stage 2
residual stage 3
residual stage 4
Which means that if residual stage 1 converges to identity, we have a layernorm into a layernorm which seems weird to me:
Can you explain this design choice ?
Thanks for your awesome work!
While stem is coherent in regard to Blocks where we have the ordering conv->norm, in dowsampling layers you put LayerNorm before convolution.
The full path is:
conv2d 4x4, stride 4
layernorm
residual stage 1
layernorm
conv2d 2x2, stride 2
residual stage 2
layernorm
conv2d 2x2, stride 2
residual stage 3
layernorm
conv2d 2x2, stride 2
residual stage 4
Which means that if residual stage 1 converges to identity, we have a layernorm into a layernorm which seems weird to me:
Can you explain this design choice ?