MCG-NJU / VideoMAE

[NeurIPS 2022 Spotlight] VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
https://arxiv.org/abs/2203.12602
Other
1.38k stars 136 forks source link

Bias in "Attention" layer. #41

Closed PeisenZhao closed 2 years ago

PeisenZhao commented 2 years ago

In modeling_finetune.py Line 82, why you set the bias of k to be 0 (requires_grad=False), which means the model only learns the bias parameters of q and v. The official timm code for image learns all the three biases of q, k, and v. Is there something special in videos?

BTW, why should the qkv be written in 84-85 lines instead of the 83 line which is commented out. Thanks!

def forward(self, x):
    B, N, C = x.shape
    qkv_bias = None
    if self.q_bias is not None:
        qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))
    # qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
    qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
    qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
    q, k, v = qkv[0], qkv[1], qkv[2]   # make torchscript happy (cannot use tensor as tuple)

    q = q * self.scale
    attn = (q @ k.transpose(-2, -1))

    attn = attn.softmax(dim=-1)
    attn = self.attn_drop(attn)

    x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
    x = self.proj(x)
    x = self.proj_drop(x)
    return x
yztongzhan commented 2 years ago

We follow BEIT and set the bias of k to None. Please refer to their code and zhihu for details.