Open ifsheldon opened 3 months ago
that is a good idea!
we add some early results in https://github.com/TencentQQGYLab/ELLA?tab=readme-ov-file#-emma---efficient-multi-modal-adapter-work-in-progress
Thanks! Awesome! These early results seem to have IPAdpater-like capabilities. Probably it also has strong in-context learning capability like given a pair of (original image, target image) as an example, it can learn and modify another image.
Hi! Great work!
Have you tried leveraging MLLM to be the prompt encoder? We have open-source MLLM now, and I think this will be an easy extension but very powerful one. For example, we could give image prompts without ControlNet or other mechanisms to inject image information. We just tell MLLM what we want with text and images, then SD generates it for us.
Update: I see this in Conclusion and Limitation. If you can release training code, then probably the community can also try to approach this direction and to adapt various LLMs
Hello, may I ask where this part of open source work can be seen? What is the paper?
we add some early results in https://github.com/TencentQQGYLab/ELLA?tab=readme-ov-file#-emma---efficient-multi-modal-adapter-work-in-progress
I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL. I have read your work on multimodal integration, can you briefly describe your approach?
I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL.
wow! Can you show some comparisons between the ELLA-SDXL results you reproduced and the original SDXL results?
can you briefly describe your approach?
EMMA is actually using both text and image embeddings as input for the Connector. The current method is still quite simple and not sufficient to write a 8-page paper, so we are conducting more experiments.
I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL.
wow! Can you show some comparisons between the ELLA-SDXL results you reproduced and the original SDXL results?
can you briefly describe your approach?
EMMA is actually using both text and image embeddings as input for the Connector. The current method is still quite simple and not sufficient to write a 8-page paper, so we are conducting more experiments.
We have improved the basic Ella structure and used Geneval for model evaluation. According to the result of repetition, the improvement in the Two objects and Color attribution is significant. Is it consistent with your conclusion? In addition, according to your description of EMMA, it feels similar to the idea of M2Chat, maybe we can maintain frequent communication in the future work
I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL. I have read your work on multimodal integration, can you briefly describe your approach?
@plastic0313 Can you please share your training scripts? I think it will really unlocks a lot of space for the community to explore. Say taking advantage of LLaMa 3
@plastic0313
According to the result of repetition, the improvement in the Two objects and Color attribution is significant. Is it consistent with your conclusion? In addition, according to your description of EMMA, it feels similar to the idea of M2Chat,
This depends on the data you use. Generally speaking, VLM-annotated captions contain a lot of accurate color descriptions, so the performance on color is much better.
maybe we can maintain frequent communication in the future work
Of course, you can contact me through the contact information on my personal website.
Are there any version of EMMA (even on beta) that I can we can play with? Would love to learn more about some preliminary results on the image generation.
we add some early results in https://github.com/TencentQQGYLab/ELLA?tab=readme-ov-file#-emma---efficient-multi-modal-adapter-work-in-progress
I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL. I have read your work on multimodal integration, can you briefly describe your approach?
Any chance you could share your work @plastic0313 ? I would like to have a test drive :)
Sincerely, best regards.
maybe i'm the only one, but the lack of weights or experiments or training logs really make the SDXL claims hard to believe.
it seems like most people feel the weights don't actually exist, nor do they behave the way they are claimed to in the paper. releasing the weights and training data can be of help to your project for this.
we add some early results in https://github.com/TencentQQGYLab/ELLA?tab=readme-ov-file#-emma---efficient-multi-modal-adapter-work-in-progress
I reproduced a version of Ella based on SD XL, and the effect is really improved over the original SD XL. I have read your work on multimodal integration, can you briefly describe your approach?
Could you mind sharing what kind of datasets you used? 34M dataset is really challenging for me.
i am not sure they actually even did what they claim.. we have been trying to train it for ~2 months. it just doesnt work for sdxl since it has two text encoders.
i am not sure they actually even did what they claim.. we have been trying to train it for ~2 months. it just doesnt work for sdxl since it has two text encoders.
The author said they used attention pooling to transform Ella embedding for fitting pooled embedding in SDXL. I have implemented this and it works. You can have a try.
just publish your results instead.
Hi! Great work!
Have you tried leveraging MLLM to be the prompt encoder? We have open-source MLLM now, and I think this will be an easy extension but very powerful one. For example, we could give image prompts without ControlNet or other mechanisms to inject image information. We just tell MLLM what we want with text and images, then SD generates it for us.
Update: I see this in Conclusion and Limitation. If you can release training code, then probably the community can also try to approach this direction and to adapt various LLMs