Some failure cases about attribute assignment

XiaominLi1997 commented 1 year ago

Hi, thanks for your nice work, I have tried the demo, and the method does have strong reasoning abilities, but there are some failure cases in attribute assignment.

Given the following response from ChatGPT: Caption: A cartoon painting of a man in red standing next to another woman in blue Objects: [('a man in red', [80, 150, 100, 200]), ('a woman in blue', [200, 150, 100, 200])] Background prompt: A cartoon painting

I obtained:

seed=4354
seed=3628

What do you think of the problem might be?

Thanks

TonyLianLong commented 1 year ago

Our Github is also updated. You can pull to update your code.

~~You might be using our Github code that has an older copy of our demo. We will update our Github soon.~~

Tried two things with our demo (https://huggingface.co/spaces/longlian/llm-grounded-diffusion, you could also clone our demo and run it locally: https://huggingface.co/spaces/longlian/llm-grounded-diffusion/tree/main):

I used your prompt but increased the frozen ratio.

The quality can improve further (by tuning other hyperparams or changing seeds), but I feel that with cartoon style SD typically puts fewer details (e.g., very few details on the faces).

However, the attribute binding should be right most of the time, with Standard guidance (Faster modes give lower guidance).

I removed the cartoon style.

Note the faces are a little weird, but that's a typical SD issue with small faces that you will also see in the baseline.

XiaominLi1997 commented 1 year ago

Hi, To clarify, the results mentioned above were obtained from running locally.

I also tried your new demo https://huggingface.co/spaces/longlian/llm-grounded-diffusion, and the results depend on seeds. When I use the same seed as you, I can get correct and reasonable results. But when I use a different seed, failure cases occur. Please see the following results. 1693105283462

Of course, I agree with

However, the attribute binding should be right most of the time, with Standard guidance (Faster modes give lower guidance).

Thanks.

TonyLianLong commented 1 year ago

It seems like the seed 4354 indeed gives a wrong association. Seed 4353 gives something right (still not beautiful, at a similar level to the SD baseline).

The seed 4354 even gets the association wrong without the man, so I would attribute this to the insufficient text-to-image object association learned from training.

Seed 4354 (only the woman):

Seed 4353 (only the woman):

Seed 4353 (man and the woman):

Longer explanation:

I think your observation can be interpreted this way: SD only sees image and paired caption during training, not instance annotations, so the associations from text to objects/attributes in the image are learned implicitly (without direct supervision).

Since SD is mostly trained on photos, with photo LMD already works in your case.

I believe cartoons are only a small fraction of the training data in the original SD (and GLIGEN, if you use LMD+), and the learned association is probably only weakly learned in the original SD that it still misses the association in some seeds.

LMD (stage 2), as a training-free method, can guide the generation to a specific layout. However, if SD doesn't recognize the text-to-object association (between "a woman in blue" and the cartoon woman in blue), then it is hard to convey the information to the model.

A solution is to train a LoRA or fine-tune the model to give the model more control. This is how people adapt and strengthen SD with a special style/domain. This is also an opportunity for future research to improve (making the association stronger and making it correct even more often).

XiaominLi1997 commented 1 year ago

Thanks for your helpful explanation, I have no more questions now. I will close this issue.

TonyLianLong / LLM-groundedDiffusion

Some failure cases about attribute assignment #9