Closed garychan22 closed 1 month ago
Hi, thanks for your excellent work here! I am reading the paper, and have a question about the character masks from cross-attention in the Fig. 2 in the paper. How can we derive the two masks for different characters with the prompt "a couple sitting on the grass"? thanks!
Thanks for your attention. The attention mask is calculated using the image prompt instead of the text prompt. Please ref Section4.7 in the paper: As seen in Equation7, the first L tokens of image prompt ci represent the background, with each subsequent set of L tokens representing each character. In each layer of image cross-attention, we obtain the cross-attention map A of size h × w for each character by summing all its L tokens.
@RedAIGC oh! sorry for missing that details, i understood now. thanks for the reply~
Hi, thanks for your excellent work here! I am reading the paper, and have a question about the character masks from cross-attention in the Fig. 2 in the paper. How can we derive the two masks for different characters with the prompt "a couple sitting on the grass"? thanks!