Ask about the design of the attention layer

Hi, Dr.Zhou, I am curious about the the design of the attention layer, since these four lines look like different with the common multi-head attention: https://github.com/ZikangZhou/QCNet/blob/55cacb418cbbce3753119c1f157360e66993d0d0/layers/attention_layer.py#L96C1-L99C40 And I would like to ask why you use dot product here rather than common matrix multiplication like:

sim = torch.matmul(q_i, k_j.transpose(1,2)) * self.scale
attn = softmax(sim_test, index, ptr)
attn = self.attn_drop(attn)
return torch.matmul(attn, v_j)

ZikangZhou / QCNet

Ask about the design of the attention layer #42