Questions about the model pre-training stage

During pre-training, if an image is used as the genaration target, learnable queries will be fed into the LLM. If an image is used as the input for text generation, the ViT embeddings will be fed into the LLM.

For image-caption data, there is a 50% chance of training on text-to-image, in which case the queries will be fed into the LLM. For the other 50%, the training is on image-to-text, in which case the ViT embeddings will be fed into the LLM. (You can set different probabilities.)

For interleaved image-text data, we also distinguish whether the image is used as input or as the generation target, and adopt different dataloaders accordingly.

AILab-CVC / SEED-X

Questions about the model pre-training stage #1