baaivision / Emu

Emu Series: Generative Multimodal Models from BAAI
https://baaivision.github.io/emu2/
Apache License 2.0
1.66k stars 86 forks source link

Generate from any prompt sequence #79

Closed lzw-lzw closed 10 months ago

lzw-lzw commented 10 months ago

Thank you for your excellent work. I notice that emu2 has the ability to generate images given text, image example and corresponding positions at the same time(such as the "Generate from any prompt sequence" part of Figure 1 in the paper), but the training data in the paper does not seem to have this type of data. I want to know if this ability of EMU2 emerges based on other types of data, or if I have overlooked some details in the paper. Thanks!

ryanzhangfan commented 10 months ago

There are mainly three parts in Figure 1(split by dashline boxes), which indicate the results of Emu2, Emu2-Chat and Emu2-Gen separately.

Emu2 abilities to generate image based on only text and image inputs while Emu2-Gen can generate from any prompt sequence. Please refer to Section 2.3.2 for more details of datasets used in training of Emu2-Gen.