In the manuscript, you used the key k and value v to calculate the self-attention weight matrix. But most transformers used the query and key to calculate the weight matrix. There is no difference in codes, but the expression ( k and v for computing self-attention weight) is confusing. Could you give me an explanation in-depth?
In the manuscript, you used the key k and value v to calculate the self-attention weight matrix. But most transformers used the query and key to calculate the weight matrix. There is no difference in codes, but the expression ( k and v for computing self-attention weight) is confusing. Could you give me an explanation in-depth?