baaivision / Emu3

Next-Token Prediction is All You Need
Apache License 2.0
1.81k stars 70 forks source link

Seperate weights for understanding and generation #2

Open QuLiao1117 opened 1 month ago

QuLiao1117 commented 1 month ago

Hi Authors,

Thanks for providing this good work! I am curious about why the model weights for gen and understanding are seperated, is there any plan for releasing one weights that is capable of both tasks?

Best

ryanzhangfan commented 1 month ago

Thanks for your interests in our work. Emu3 base model is pretrained on a mixture of multimodal sequences (texts, images, videos, etc.), making it inherently capable of handling various multimodal tasks such as vision-language understanding and image/video generation. Emu3-Chat and Emu3-Gen are post training models separately for vision-language understanding and vision generation. We will release one unified post training model for vision-language understanding and vision generation.