Have you trained it on any tasks to generate text and an image? For instance, responding to the instruction and then the image token data?
Could be useful for managing context over multishot instructions; for instance;
Instructing the model to "think step by step" for CoT reasoning before inferencing the image.
Hi, @matbee-eth , currently, we only train the model for image generation. Your suggestion is a great idea. We have also considered this and will try multimodal generation models in the future.
Have you trained it on any tasks to generate text and an image? For instance, responding to the instruction and then the image token data? Could be useful for managing context over multishot instructions; for instance; Instructing the model to "think step by step" for CoT reasoning before inferencing the image.