Open QuLiao1117 opened 1 month ago
Thanks for your interests in our work. Emu3 base model is pretrained on a mixture of multimodal sequences (texts, images, videos, etc.), making it inherently capable of handling various multimodal tasks such as vision-language understanding and image/video generation. Emu3-Chat and Emu3-Gen are post training models separately for vision-language understanding and vision generation. We will release one unified post training model for vision-language understanding and vision generation.
Hi Authors,
Thanks for providing this good work! I am curious about why the model weights for gen and understanding are seperated, is there any plan for releasing one weights that is capable of both tasks?
Best