Curious about text + image token output

VectorSpaceLab / OmniGen

OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340

MIT License

2.57k stars 190 forks source link

Curious about text + image token output #41

Open matbee-eth opened 2 weeks ago

matbee-eth commented 2 weeks ago

Have you trained it on any tasks to generate text and an image? For instance, responding to the instruction and then the image token data? Could be useful for managing context over multishot instructions; for instance; Instructing the model to "think step by step" for CoT reasoning before inferencing the image.

staoxiao commented 2 weeks ago

Hi, @matbee-eth , currently, we only train the model for image generation. Your suggestion is a great idea. We have also considered this and will try multimodal generation models in the future.