Fantasy-Studio / Paint-by-Example

Paint by Example: Exemplar-based Image Editing with Diffusion Models
https://arxiv.org/abs/2211.13227
Other
1.08k stars 96 forks source link

Problem about cross-attention during training #43

Open dorianzhang7 opened 1 year ago

dorianzhang7 commented 1 year ago

hello @Fantasy-Studio I noticed something odd when trying to train the network with the upload code. After training some iterations, I checked the parameter values of the cross-attention module of the saved model and found that only the parameters of the to_v network have changed, and the parameters of the to_k and to_q networks have not changed (no matter how many times they are trained). Therefore, I specifically recorded the backpropagation gradient value of the cross-attention model parameters, as shown in the following figure: to_k: image to_v: image This situation is consistent with what I have observed so far. After debugging the code, I found that the CLIP used in the paper only extracts outputs.pooler_output as cond, which has a dimension of 1X1024. After passing the cross-attention network, the q vector is 4096X40, k and v vectors are 1X40. According to the cross attention formula: image The result of the product of q and k is a vector of 4096X1. After softmax processing, because it is a one-dimensional vector, its vector value will all become 1. At this time, the mechanism of the attention module will fail, and the output result will be the v vector. It has nothing to do with k and q. The above is my analysis and verification of this situation, but when I compared sd-v1-4.ckpt with the pre-trained model parameters of the paint-by-example uploaded by the author, I found that the to_k, to_q, and to_v of the cross-attention modules of the two are different, which makes me very confused. I would like to ask you if you have encountered the same problem, thank you very much for your reply!

Will this happen when researchers on related topics train this part of the code? Thank you for your answers

dorianzhang7 commented 1 year ago

Since the conditional vector is one-dimensional class vector from CLIP, the values of the attention map in cross-attention network are all equal to 1. So I doubt whether the parameters of the attention map can be trained, How does the pre-trained model do it?