ELLA generate img quality is worse than origin model

akk-123 commented 7 months ago

ELLA has stronger text understanding capabilities, but the image quality is worse

budui commented 7 months ago

Hi, can you share the base Model and prompt you use? We need to debug ella with them. Additionally, we have updated our README with some tips on how to better utilize ELLA, which you may find helpful.

akk-123 commented 7 months ago

base model: https://civitai.com/models/15003/cyberrealistic prompt: a woman, pink hair, wearing blue baseball cap, wearning sunglasses, green scarf, yellow sweater I found that high cfg scale will cause img blur and noise, decrease cfg(for example cfg=3) will get better quality img

what' more, ELLA's text understanding ability does not perform so well in multi-person situations promt: a man and and a woman, woman standing left, pink hair, wearing blue baseball cap, wearing earrings and necklace, green scarf, yellow sweater; man standing right, wearing sunglasses and blue t-shirt

scarbain commented 7 months ago

From my testings, using plain English prompts helps a lot compared to using keywords. Also, be careful of the negative prompt you're using and the size of generation, a different size than 512x512 is not resulting in good generations for me

budui commented 7 months ago

using plain English prompts helps a lot compared to using keywords.

Indeed, because we mainly use synthetic captions similar to pure English to train ella

You can refer to test examples from the community: https://imgur.com/a/FhWpSSb

budui commented 7 months ago

This caption may still be too difficult for ella-sd1.5. 😥

Refined caption by me:

A man and a woman are standing together, the woman on the left and the man on the right. The woman has pink hair, a blue basketball cap, earrings and a necklace, and a green scarf and yellow sweater. Man wearing sunglasses and blue shirt.

Refined caption by Qwen:

A couple stands side by side, creating an eye-catching visual display of contrasting styles. The woman, positioned to the left, sports vibrant pink hair that falls just past her shoulders, complemented by a blue baseball cap adorned with intricate earrings and a necklace. She wraps herself in a green scarf for warmth and wears a sunny yellow sweater, adding pops of color to the scene. On the other hand, the man, standing to the right, exudes coolness with his sunglasses, donning a casual blue t-shirt. Both individuals seem to be enjoying each other's company amidst their unique fashion choices.

Refined caption by GPT4:

A woman with vibrant pink hair stands to the left, her head adorned with a blue baseball cap. She's wearing a pair of shiny earrings and a necklace that glimmers under the light. Wrapped around her neck is a green scarf, contrasting with her bright yellow sweater. To her right stands a man, his eyes hidden behind a pair of dark sunglasses. He's casually dressed in a blue t-shirt, his hands tucked into his pockets.

akk-123 commented 7 months ago

so, Looking forward to sdxl !!

scarbain commented 7 months ago

This caption may still be too difficult for ella-sd1.5. 😥

Refined caption by me:

A man and a woman are standing together, the woman on the left and the man on the right. The woman has pink hair, a blue basketball cap, earrings and a necklace, and a green scarf and yellow sweater. Man wearing sunglasses and blue shirt.

Refined caption by Qwen:

A couple stands side by side, creating an eye-catching visual display of contrasting styles. The woman, positioned to the left, sports vibrant pink hair that falls just past her shoulders, complemented by a blue baseball cap adorned with intricate earrings and a necklace. She wraps herself in a green scarf for warmth and wears a sunny yellow sweater, adding pops of color to the scene. On the other hand, the man, standing to the right, exudes coolness with his sunglasses, donning a casual blue t-shirt. Both individuals seem to be enjoying each other's company amidst their unique fashion choices.

Refined caption by GPT4:

A woman with vibrant pink hair stands to the left, her head adorned with a blue baseball cap. She's wearing a pair of shiny earrings and a necklace that glimmers under the light. Wrapped around her neck is a green scarf, contrasting with her bright yellow sweater. To her right stands a man, his eyes hidden behind a pair of dark sunglasses. He's casually dressed in a blue t-shirt, his hands tucked into his pockets.

Have you tried finetuning SD1.5 with randomly using T5 with your adapter or clip ? Since your adapter only acts on the caption side, maybe having a full Unet finetuned with proper captions (using CogVLM for example) + randomly selecting a different text encoder at each step, could help it even more ?

scarbain commented 7 months ago

Of course we would freeze clip and your adapter and only finetune unet

budui commented 7 months ago

@scarbain I think randomly selecting text encoder during training is a very interesting idea, which can make UNet adapt to both CLIP and ELLA. However, due to resource constraints and limited usage scenarios, we prefer to train some Adapters instead of fine-tuning UNet. We will try to train ELLA as a branch of IP-Adapter later. Maybe IP-Adapter+ELLA will be more practical?

scarbain commented 7 months ago

It could be interesting to train a LoRA only instead of full FT of unet. And using this LoRA when using ELLA. Kind of like a LCM LoRA in usage. I'll try that and report back if it provides good results.

andupotorac commented 5 months ago

It could be interesting to train a LoRA only instead of full FT of unet. And using this LoRA when using ELLA. Kind of like a LCM LoRA in usage. I'll try that and report back if it provides good results.

How did it go?

andupotorac commented 4 months ago

This caption may still be too difficult for ella-sd1.5. 😥

Refined caption by me:

A man and a woman are standing together, the woman on the left and the man on the right. The woman has pink hair, a blue basketball cap, earrings and a necklace, and a green scarf and yellow sweater. Man wearing sunglasses and blue shirt.

Refined caption by Qwen:

A couple stands side by side, creating an eye-catching visual display of contrasting styles. The woman, positioned to the left, sports vibrant pink hair that falls just past her shoulders, complemented by a blue baseball cap adorned with intricate earrings and a necklace. She wraps herself in a green scarf for warmth and wears a sunny yellow sweater, adding pops of color to the scene. On the other hand, the man, standing to the right, exudes coolness with his sunglasses, donning a casual blue t-shirt. Both individuals seem to be enjoying each other's company amidst their unique fashion choices.

Refined caption by GPT4:

A woman with vibrant pink hair stands to the left, her head adorned with a blue baseball cap. She's wearing a pair of shiny earrings and a necklace that glimmers under the light. Wrapped around her neck is a green scarf, contrasting with her bright yellow sweater. To her right stands a man, his eyes hidden behind a pair of dark sunglasses. He's casually dressed in a blue t-shirt, his hands tucked into his pockets.

Where are you demo-ing these? Is there a HuggingFace demo we can use?

TencentQQGYLab / ELLA

ELLA generate img quality is worse than origin model #20