facebookresearch / mvit

Code Release for MViTv2 on Image Recognition.
Apache License 2.0
391 stars 46 forks source link

MultiScaleBlock potentially add two tensors with shape mismatch #22

Open chopwoodwater opened 3 months ago

chopwoodwater commented 3 months ago

Hi Team,

In themvit/models/attention.py, the MultiScaleBlock adds the input x_res and x_block. However, there is no guarantee that these two Tensor is additive in terms of shape.

For example, x.shape is [8, 16, 64], hw_shape is [4, 4], assuming time T = 1, L = 16 = 4 * 4. All kernel and stride sizes of q, k, v are (2, 2).

x_res will have shape [8, 4, 64], and x_block will have shape [8, 9, 64]. Adding these two tensors will generate runtime error, saying shape of both tensors need to match at non-singleton dimension 1.

    def forward(self, x, hw_shape):
        x_norm = self.norm1(x)
        x_block, hw_shape_new = self.attn(x_norm, hw_shape)

        if self.dim_mul_in_att and self.dim != self.dim_out:
            x = self.proj(x_norm)
        x_res, _ = attention_pool(
            x, self.pool_skip, hw_shape, has_cls_embed=self.has_cls_embed
        )
        x = x_res + self.drop_path(x_block)