facebookresearch / mvit

Code Release for MViTv2 on Image Recognition.
Apache License 2.0
372 stars 43 forks source link

Possible bug in MultiScaleBlock? #15

Closed rwightman closed 1 year ago

rwightman commented 1 year ago

I was looking at the mvitv2 model for inclusion in timm and discovered an inconsistency...

https://github.com/facebookresearch/mvit/blob/f05fc78265222c5c7a047bb00de137fcc93bf764/mvit/models/attention.py#L377-L388

If the shortcut projection is used in the attention path it uses the normalized input. If the shortcut projection is not used, the shortcut uses the non-normalized input. It's likely not a significant issue either way, but was this the intended behaviour? To use normalized x in the attention shortcut only when the shortcut projection is used?

rwightman commented 1 year ago

Spending more time looking at design, likely was intentional...

lyttonhao commented 1 year ago

Hi @rwightman , thanks for your interest in our mvit model. Yes, this is an intentional design.

The shortcut projection is only used when there is a dimension change (it happens between two stages in the MViT model). Like you mentioned,

  1. When shortcut projection is not used, we will use the non-normalized input which is similar to the standard ViT models.
  2. When a shortcut project is used, we use the normalized input for projection linear.

For 2, we think using normalized input for the projection layer could help training. Our ablation on a MViTv2-S on K400 also demonstrates it has +0.5 gain using this design.