TencentQQGYLab / ELLA

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
https://ella-diffusion.github.io/
Apache License 2.0
1.1k stars 57 forks source link

Ella can sometimes make already-correct results much less correct #35

Open Akira13641 opened 7 months ago

Akira13641 commented 7 months ago

RealCartoon 3D V15

Princess Peach is standing next to Tifa Lockhart, they are outside on a summer day, they are wearing bikinis. high quality, best quality, masterpiece

Without Ella: ComfyUI_14055_

With ELLA, same seed: ComfyUI_14056_

Using Ella in this case turns Princess Peach into a random pink-haired girl instead of the recognizable character.

budui commented 7 months ago

During ELLA's training, a large number of synthetic captions were used, which typically do not include names or character names. Therefore, if your prompt contains a name, ELLA's performance is very poor. You can try replacing the name with 'a woman' and concatenate the output of CLIP.

jyoung105 commented 7 months ago

@budui Thanks for explanation. And this is what I think there should be tricks to generate the better synthetic captions or mixing short captions and synthetic long captions.

andupotorac commented 5 months ago

During ELLA's training, a large number of synthetic captions were used, which typically do not include names or character names. Therefore, if your prompt contains a name, ELLA's performance is very poor. You can try replacing the name with 'a woman' and concatenate the output of CLIP.

We want to prevent this issue on our end as well. What did you mean when you said "concatenate the output of CLIP"? Where would the name be introduced again?