Closed daiyixiang666 closed 3 months ago
Beside, which precision do you use for the t5-xxl embedding.
Hi~ Our motivation is to do preliminary exploration for multimodal foundations models. In such a foundation model, text token, image token, even audio token, have no difference except for their index in the vocabulary. These tokens are interacting with self-attention, instead of cross-attention.
We use flan-t5-xl with bf16 precision.
In your code, you simply concat the text embedding with the image token embedding. So my question is why you choose this and instead of doing cross attention? Are there any main difference in these two method ?