TencentQQGYLab / ELLA

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
https://ella-diffusion.github.io/
Apache License 2.0
977 stars 49 forks source link

Multimodal Large Language Model Support? #15

Open ifsheldon opened 3 months ago

ifsheldon commented 3 months ago

Hi! Great work!

Have you tried leveraging MLLM to be the prompt encoder? We have open-source MLLM now, and I think this will be an easy extension but very powerful one. For example, we could give image prompts without ControlNet or other mechanisms to inject image information. We just tell MLLM what we want with text and images, then SD generates it for us.

Update: I see this in Conclusion and Limitation. If you can release training code, then probably the community can also try to approach this direction and to adapt various LLMs

Manni1000 commented 3 months ago

that is a good idea!

budui commented 2 months ago

we add some early results in https://github.com/TencentQQGYLab/ELLA?tab=readme-ov-file#-emma---efficient-multi-modal-adapter-work-in-progress

ifsheldon commented 2 months ago

Thanks! Awesome! These early results seem to have IPAdpater-like capabilities. Probably it also has strong in-context learning capability like given a pair of (original image, target image) as an example, it can learn and modify another image.

plastic0313 commented 2 months ago

Hi! Great work!

Have you tried leveraging MLLM to be the prompt encoder? We have open-source MLLM now, and I think this will be an easy extension but very powerful one. For example, we could give image prompts without ControlNet or other mechanisms to inject image information. We just tell MLLM what we want with text and images, then SD generates it for us.

Update: I see this in Conclusion and Limitation. If you can release training code, then probably the community can also try to approach this direction and to adapt various LLMs

Hello, may I ask where this part of open source work can be seen? What is the paper?

plastic0313 commented 2 months ago

we add some early results in https://github.com/TencentQQGYLab/ELLA?tab=readme-ov-file#-emma---efficient-multi-modal-adapter-work-in-progress

I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL. I have read your work on multimodal integration, can you briefly describe your approach?

budui commented 2 months ago

I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL.

wow! Can you show some comparisons between the ELLA-SDXL results you reproduced and the original SDXL results?

can you briefly describe your approach?

EMMA is actually using both text and image embeddings as input for the Connector. The current method is still quite simple and not sufficient to write a 8-page paper, so we are conducting more experiments.

plastic0313 commented 2 months ago

I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL.

wow! Can you show some comparisons between the ELLA-SDXL results you reproduced and the original SDXL results?

can you briefly describe your approach?

EMMA is actually using both text and image embeddings as input for the Connector. The current method is still quite simple and not sufficient to write a 8-page paper, so we are conducting more experiments.

We have improved the basic Ella structure and used Geneval for model evaluation. According to the result of repetition, the improvement in the Two objects and Color attribution is significant. Is it consistent with your conclusion? In addition, according to your description of EMMA, it feels similar to the idea of M2Chat, maybe we can maintain frequent communication in the future work

ifsheldon commented 2 months ago

I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL. I have read your work on multimodal integration, can you briefly describe your approach?

@plastic0313 Can you please share your training scripts? I think it will really unlocks a lot of space for the community to explore. Say taking advantage of LLaMa 3

budui commented 2 months ago

@plastic0313

According to the result of repetition, the improvement in the Two objects and Color attribution is significant. Is it consistent with your conclusion? In addition, according to your description of EMMA, it feels similar to the idea of M2Chat,

This depends on the data you use. Generally speaking, VLM-annotated captions contain a lot of accurate color descriptions, so the performance on color is much better.

maybe we can maintain frequent communication in the future work

Of course, you can contact me through the contact information on my personal website.

rndm-jpg commented 2 months ago

Are there any version of EMMA (even on beta) that I can we can play with? Would love to learn more about some preliminary results on the image generation.

martinobettucci commented 2 months ago

we add some early results in https://github.com/TencentQQGYLab/ELLA?tab=readme-ov-file#-emma---efficient-multi-modal-adapter-work-in-progress

I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL. I have read your work on multimodal integration, can you briefly describe your approach?

Any chance you could share your work @plastic0313 ? I would like to have a test drive :)

Sincerely, best regards.

bghira commented 2 months ago

maybe i'm the only one, but the lack of weights or experiments or training logs really make the SDXL claims hard to believe.

it seems like most people feel the weights don't actually exist, nor do they behave the way they are claimed to in the paper. releasing the weights and training data can be of help to your project for this.

George0726 commented 4 weeks ago

we add some early results in https://github.com/TencentQQGYLab/ELLA?tab=readme-ov-file#-emma---efficient-multi-modal-adapter-work-in-progress

I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL. I have read your work on multimodal integration, can you briefly describe your approach?

Could you mind sharing what kind of datasets you used? 34M dataset is really challenging for me.

bghira commented 3 weeks ago

i am not sure they actually even did what they claim.. we have been trying to train it for ~2 months. it just doesnt work for sdxl since it has two text encoders.

George0726 commented 2 weeks ago

i am not sure they actually even did what they claim.. we have been trying to train it for ~2 months. it just doesnt work for sdxl since it has two text encoders.

The author said they used attention pooling to transform Ella embedding for fitting pooled embedding in SDXL. I have implemented this and it works. You can have a try.

bghira commented 2 weeks ago

just publish your results instead.