Bias in "Attention" layer.

In modeling_finetune.py Line 82, why you set the bias of k to be 0 (requires_grad=False), which means the model only learns the bias parameters of q and v. The official timm code for image learns all the three biases of q, k, and v. Is there something special in videos?

BTW, why should the qkv be written in 84-85 lines instead of the 83 line which is commented out. Thanks!

def forward(self, x):
    B, N, C = x.shape
    qkv_bias = None
    if self.q_bias is not None:
        qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))
    # qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
    qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
    qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
    q, k, v = qkv[0], qkv[1], qkv[2]   # make torchscript happy (cannot use tensor as tuple)

    q = q * self.scale
    attn = (q @ k.transpose(-2, -1))

    attn = attn.softmax(dim=-1)
    attn = self.attn_drop(attn)

    x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
    x = self.proj(x)
    x = self.proj_drop(x)
    return x

MCG-NJU / VideoMAE

Bias in "Attention" layer. #41