Closed JYS997760473 closed 2 months ago
Hi, Dr.Zhou, I am curious about the the design of the attention layer, since these four lines look like different with the common multi-head attention: https://github.com/ZikangZhou/QCNet/blob/55cacb418cbbce3753119c1f157360e66993d0d0/layers/attention_layer.py#L96C1-L99C40 And I would like to ask why you use dot product here rather than common matrix multiplication like:
sim = torch.matmul(q_i, k_j.transpose(1,2)) * self.scale attn = softmax(sim_test, index, ptr) attn = self.attn_drop(attn) return torch.matmul(attn, v_j)
It's scaled dot-product attention.
Hi, Dr.Zhou, I am curious about the the design of the attention layer, since these four lines look like different with the common multi-head attention: https://github.com/ZikangZhou/QCNet/blob/55cacb418cbbce3753119c1f157360e66993d0d0/layers/attention_layer.py#L96C1-L99C40
And I would like to ask why you use dot product here rather than common matrix multiplication like: