Closed devnkong closed 1 year ago
Ah yes, so the first half of the heads in the earlier layers operates on the unconditional latent embeddings (of classifier-free guidance), initialized here: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py#L639. Since we care about the text-conditional embeddings only, we throw away those nuisance attention heads. You can verify that this procedure is sensible by visualizing the unconditional heads, e.g., map_ = map_[:map_.size(0) // 2]
.
Thank you so much!
Thanks for your great work! I wanna know why we need this operation below? We see that we only need half of the attn maps, for example if we have 8 heads then below for
map_.size(0)
we will have 16. But why do we have 16 in the first place considering we only have 8 heads each transformer block? Can you show me where does diffusers do this? Really confused, thank you!https://github.com/castorini/daam/blob/119d8ff1dd4e61ef579824f3112fb0010eb2fff0/daam/trace.py#L215