In themvit/models/attention.py, the MultiScaleBlock adds the input x_res and x_block. However, there is no guarantee that these two Tensor is additive in terms of shape.
For example, x.shape is [8, 16, 64], hw_shape is [4, 4], assuming time T = 1, L = 16 = 4 * 4. All kernel and stride sizes of q, k, v are (2, 2).
x_res will have shape [8, 4, 64], and x_block will have shape [8, 9, 64]. Adding these two tensors will generate runtime error, saying shape of both tensors need to match at non-singleton dimension 1.
def forward(self, x, hw_shape):
x_norm = self.norm1(x)
x_block, hw_shape_new = self.attn(x_norm, hw_shape)
if self.dim_mul_in_att and self.dim != self.dim_out:
x = self.proj(x_norm)
x_res, _ = attention_pool(
x, self.pool_skip, hw_shape, has_cls_embed=self.has_cls_embed
)
x = x_res + self.drop_path(x_block)
Hi Team,
In the
mvit/models/attention.py
, theMultiScaleBlock
adds the inputx_res
andx_block
. However, there is no guarantee that these two Tensor is additive in terms of shape.For example, x.shape is [8, 16, 64], hw_shape is [4, 4], assuming time T = 1, L = 16 = 4 * 4. All kernel and stride sizes of
q, k, v
are (2, 2).x_res
will have shape [8, 4, 64], andx_block
will have shape [8, 9, 64]. Adding these two tensors will generate runtime error, sayingshape of both tensors need to match at non-singleton dimension 1.