Weili-NLP / UNIMO-G

86 stars 0 forks source link

Visual-Enhanced Learning Loss Calculation #5

Closed FlyHighest closed 6 months ago

FlyHighest commented 6 months ago

Thanks for sharing this great work!

Learning a visual token that corresponds to latent features is an interesting idea. I'd like to know more details about the $L_{attn}$. There are many cross-attention layers in the UNet, which of them are selected to calculate $CA(z_t,v)$?

Weili-NLP commented 6 months ago

We utilize all cross-attention layers by averaging their attention scores. Perhaps a more fine-grained selection could further enhance performance. We leave this for future work.

FlyHighest commented 6 months ago

Thank you!