FoundationVision / LlamaGen

Autoregressive Model Beats Diffusion: 🦙 Llama for Scalable Image Generation
https://arxiv.org/abs/2406.06525
MIT License
1.21k stars 48 forks source link

Text embedding inject #33

Closed daiyixiang666 closed 3 months ago

daiyixiang666 commented 3 months ago

In your code, you simply concat the text embedding with the image token embedding. So my question is why you choose this and instead of doing cross attention? Are there any main difference in these two method ?

daiyixiang666 commented 3 months ago

Beside, which precision do you use for the t5-xxl embedding.

PeizeSun commented 3 months ago

Hi~ Our motivation is to do preliminary exploration for multimodal foundations models. In such a foundation model, text token, image token, even audio token, have no difference except for their index in the vocabulary. These tokens are interacting with self-attention, instead of cross-attention.

We use flan-t5-xl with bf16 precision.