Closed JunZhan2000 closed 5 months ago
During pre-training, if an image is used as the genaration target, learnable queries will be fed into the LLM. If an image is used as the input for text generation, the ViT embeddings will be fed into the LLM.
For image-caption data, there is a 50% chance of training on text-to-image, in which case the queries will be fed into the LLM. For the other 50%, the training is on image-to-text, in which case the ViT embeddings will be fed into the LLM. (You can set different probabilities.)
For interleaved image-text data, we also distinguish whether the image is used as input or as the generation target, and adopt different dataloaders accordingly.
Thanks for your great work! I have some questions. How are images processed during the pre-training phase? In the example in Figure 4, why are the input of the second image queries, not the vit embedding? How are image-caption data or interleaved image-text data handled?