Closed andreemic closed 7 months ago
Hey, thanks. I may be wrong as I'm not too familiar with the InstructPix2Pix architecture, but I think focusing on the cross-attention heads between the key text embeddings and the usual latent embeddings could work. If the attention key vectors are instead a concatenation of text embeddings and, say, image embeddings, then you could look at cross attention restricted to the text dimensions/area. If the text and image embeddings are unseparable (e.g., multimodal fusion), then that would likely be outside of the scope of DAAM/cross-attention and require a separate set of techniques.
@andreemic Please let me know if you were able to generate cross-attention maps for IP2P or ControlNet.
I am trying to visualize cross-attention maps for Stable Diffusion image-to-image pipeline and facing same errors.
@daemon Opened a pull request which fixes this. Please have a look.
Hey! Great job on this repo! Very clean documentation and a useful idea.