Closed rwightman closed 1 year ago
Spending more time looking at design, likely was intentional...
Hi @rwightman , thanks for your interest in our mvit model. Yes, this is an intentional design.
The shortcut projection is only used when there is a dimension change (it happens between two stages in the MViT model). Like you mentioned,
For 2, we think using normalized input for the projection layer could help training. Our ablation on a MViTv2-S on K400 also demonstrates it has +0.5 gain using this design.
I was looking at the mvitv2 model for inclusion in
timm
and discovered an inconsistency...https://github.com/facebookresearch/mvit/blob/f05fc78265222c5c7a047bb00de137fcc93bf764/mvit/models/attention.py#L377-L388
If the shortcut projection is used in the attention path it uses the normalized input. If the shortcut projection is not used, the shortcut uses the non-normalized input. It's likely not a significant issue either way, but was this the intended behaviour? To use normalized x in the attention shortcut only when the shortcut projection is used?