baaivision / Emu3

Next-Token Prediction is All You Need
Apache License 2.0
570 stars 13 forks source link

What is the difference between Emu3-Chat and Emu3-Gen? #1

Open charlesCXK opened 3 days ago

charlesCXK commented 3 days ago

Hi, this is excellent work! I have a question. I’d like to know why the model was split into two. Can EMU3-Gen still maintain the same comprehension performance as EMU3-Chat?

ryanzhangfan commented 3 days ago

Thanks for your interests in our work. Emu3 base model is pretrained on a mixture of multimodal sequences (texts, images, videos, etc.), making it inherently capable of handling various multimodal tasks such as vision-language understanding and image/video generation. Emu3-Chat and Emu3-Gen are post training models separately for vision-language understanding and vision generation. We will release one unified post training model for vision-language understanding and vision generation.