dongbo811 / AFFormer

119 stars 12 forks source link

Similarity kernel in the FSK module #10

Closed martin-liao closed 7 months ago

martin-liao commented 1 year ago

In the manuscript, you used the key k and value v to calculate the self-attention weight matrix. But most transformers used the query and key to calculate the weight matrix. There is no difference in codes, but the expression ( k and v for computing self-attention weight) is confusing. Could you give me an explanation in-depth?

LaBiXiaoChai commented 1 year ago

模型中用的不是标准的self-attention, 而是经过改进后的,有线性复杂度的 self-attention。 所以公式里是 kv 相乘,而不是 qk 相乘