Problem about cross-attention during training

hello @Fantasy-Studio I noticed something odd when trying to train the network with the upload code. After training some iterations, I checked the parameter values of the cross-attention module of the saved model and found that only the parameters of the to_v network have changed, and the parameters of the to_k and to_q networks have not changed (no matter how many times they are trained). Therefore, I specifically recorded the backpropagation gradient value of the cross-attention model parameters, as shown in the following figure: to_k: to_v: This situation is consistent with what I have observed so far. After debugging the code, I found that the CLIP used in the paper only extracts outputs.pooler_output as cond, which has a dimension of 1X1024. After passing the cross-attention network, the q vector is 4096X40, k and v vectors are 1X40. According to the cross attention formula: The result of the product of q and k is a vector of 4096X1. After softmax processing, because it is a one-dimensional vector, its vector value will all become 1. At this time, the mechanism of the attention module will fail, and the output result will be the v vector. It has nothing to do with k and q. The above is my analysis and verification of this situation, but when I compared sd-v1-4.ckpt with the pre-trained model parameters of the paint-by-example uploaded by the author, I found that the to_k, to_q, and to_v of the cross-attention modules of the two are different, which makes me very confused. I would like to ask you if you have encountered the same problem, thank you very much for your reply!

Will this happen when researchers on related topics train this part of the code? Thank you for your answers

Fantasy-Studio / Paint-by-Example

Problem about cross-attention during training #43