In modeling_finetune.py Line 82, why you set the bias of k to be 0 (requires_grad=False), which means the model only learns the bias parameters of q and v. The official timm code for image learns all the three biases of q, k, and v. Is there something special in videos?
BTW, why should the qkv be written in 84-85 lines instead of the 83 line which is commented out. Thanks!
def forward(self, x):
B, N, C = x.shape
qkv_bias = None
if self.q_bias is not None:
qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))
# qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[2] # make torchscript happy (cannot use tensor as tuple)
q = q * self.scale
attn = (q @ k.transpose(-2, -1))
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
x = self.proj(x)
x = self.proj_drop(x)
return x
In modeling_finetune.py Line 82, why you set the bias of k to be 0 (requires_grad=False), which means the model only learns the bias parameters of q and v. The official timm code for image learns all the three biases of q, k, and v. Is there something special in videos?
BTW, why should the qkv be written in 84-85 lines instead of the 83 line which is commented out. Thanks!