Learning a visual token that corresponds to latent features is an interesting idea. I'd like to know more details about the $L_{attn}$. There are many cross-attention layers in the UNet, which of them are selected to calculate $CA(z_t,v)$?
We utilize all cross-attention layers by averaging their attention scores. Perhaps a more fine-grained selection could further enhance performance. We leave this for future work.
Thanks for sharing this great work!
Learning a visual token that corresponds to latent features is an interesting idea. I'd like to know more details about the $L_{attn}$. There are many cross-attention layers in the UNet, which of them are selected to calculate $CA(z_t,v)$?