Open charlesCXK opened 3 days ago
Thanks for your interests in our work. Emu3 base model is pretrained on a mixture of multimodal sequences (texts, images, videos, etc.), making it inherently capable of handling various multimodal tasks such as vision-language understanding and image/video generation. Emu3-Chat and Emu3-Gen are post training models separately for vision-language understanding and vision generation. We will release one unified post training model for vision-language understanding and vision generation.
Hi, this is excellent work! I have a question. I’d like to know why the model was split into two. Can EMU3-Gen still maintain the same comprehension performance as EMU3-Chat?