RedAIGC / StoryMaker

StoryMaker: Towards consistent characters in text-to-image generation
556 stars 48 forks source link

about the character masks from cross-attention #11

Closed garychan22 closed 1 month ago

garychan22 commented 1 month ago

Hi, thanks for your excellent work here! I am reading the paper, and have a question about the character masks from cross-attention in the Fig. 2 in the paper. How can we derive the two masks for different characters with the prompt "a couple sitting on the grass"? thanks!

RedAIGC commented 1 month ago

Hi, thanks for your excellent work here! I am reading the paper, and have a question about the character masks from cross-attention in the Fig. 2 in the paper. How can we derive the two masks for different characters with the prompt "a couple sitting on the grass"? thanks!

Thanks for your attention. The attention mask is calculated using the image prompt instead of the text prompt. Please ref Section4.7 in the paper: As seen in Equation7, the first L tokens of image prompt ci represent the background, with each subsequent set of L tokens representing each character. In each layer of image cross-attention, we obtain the cross-attention map A of size h × w for each character by summing all its L tokens.

garychan22 commented 1 month ago

@RedAIGC oh! sorry for missing that details, i understood now. thanks for the reply~