Questions about per-box generation process

Show-han commented 1 year ago

I have a small question about the per-box generation process.

I'm curious why it isn't possible to generate multiple objects simultaneously. For example, let's say we have two boxes, A and B, in an image. Couldn't we allow box A attend to object a and box B attend to object b? Is there a significant difference in quality compared to the suggested "per-box generation" approach? If we can generate all objects simultaneously, we don't even have to perform the DDIM inversion process

TonyLianLong commented 1 year ago

As illustrated in the work, cross-attention only has coarse-grained semantic-level control. It does not control the instances (i.e., how many instances should we have in the region). Different instances show up together as long as they are in the same instance. Therefore, it sometimes generates several objects/object parts. Our per-box generation gets around that by generating one instance as specified in the prompt and then composing the latents to ensure instance-level control.

DDIM Inversion process is only needed if you scale and shift the box to make it better positioned, since cross-attention does not enforce strict control on the object locations (as it's enforced on a very low-resolution attention map). Otherwise, the original denoising process can be directly masked instead of DDIM inversion, as implemented in the demo.

Feel free to follow up for any questions!

Show-han commented 1 year ago

Thank you for your response. I noticed in the huggingface demo, you replace the attention manipulation in your paper with GLIGEN. You use another layout control method here, so are there any quality differences between those two implementations? If possible, can you share the code w.r.t the attention manipulation in your paper? This would be very helpful for my current work : )

TonyLianLong commented 1 year ago

Attention manipulation method, since it's training-free, does not deal with instances well. Stable Diffusion has not been trained on instance annotations, so it's not surprising that cross-attention does not differentiate instances and hence controlling cross-attention does not lead to instance-level control. Our method gets around this problem by using per-box generation, so technically we don't need to use GLIGEN, which makes our method completely training-free.

The reason for not using attention manipulation in the demo is that while forward guidance leads to lower generation quality and insufficient control, backward guidance leads to significantly higher compute costs (more GPU memory and slower).

In contrast, GLIGEN is trained on instance annotated data (i.e., training-based w.r.t SD) so has better instance-level control. Since it only adds adapter layers, the computational cost is very similar. Nevertheless, it does not guarantee the generation/control quality of out-of-distribution data (e.g., if you put attributes only in the GLIGEN condition, it may not be so effective; if you put attributes in the prompt with multiple objects, you may observe semantic leakage).

Therefore, the best way is really to use both together. To maximize the generation quality, I would suggest using GLIGEN to control instances and using attention manipulation to minimize semantic leakage and maximize semantic control.

To manipulate attention, you can take a look at the code here: https://huggingface.co/spaces/longlian/llm-grounded-diffusion/blob/main/models/attention_processor.py#L465

You can either change this part to manipulate the cross-attention in a way you wish (note that you also need to mask [EOT] in addition to the word tokens), or retrieve the cross-attention and implement a loss to perform backprop to the latents for gradient descent. Note that you can turn on/off GLIGEN with no impact on the attention control.

This implementation supports SD v1.4 to v2.0 (and should also support older versions). The implementation also supports FlashAttention on layers that you don't need attention maps.

I can also help you if you tell me a little about your work by sending me an email :)

Show-han commented 1 year ago

Thank you very much for your thoughtful response. Finally, I would like to ask for one more suggestion: If I'm not too concerned about the specific location of the generated instances or their corresponding attributes, and I simply want all the objects mentioned in my prompt to naturally appear in the generated image (as the original SD model tends to easily omit certain objects), what methods can I try? I have attempted attention manipulation techniques like attend-and-excite, but they haven't performed well.

TonyLianLong commented 1 year ago

Our method (stage 1 and stage 2 together) achieves what you describe. Stage 1 generates the object in a reasonable position, and stage 2 guides the generation by putting the object in stage 1 into the image (only guides the first few steps and lets SD create natural generation from there). No specific object position has to be specified by the user.

Note that attention manipulation only controls the semantics, so you may get object parts, etc. That's why we go for per-box generation and composing the instances.

Training a GLIGEN without input location as the condition is also an approach to your problem, while you need an object position (e.g., the one from stage 1) to use the pre-trained GLIGEN.

Show-han commented 1 year ago

Thanks very much. I will look up the code carefully before asking follow up questions🤔 And I think controllable generation for stable diffusion is meaningful and has its impact, maybe we can make connections, share some ideas and work collaboratively in the future work

TonyLianLong / LLM-groundedDiffusion

Questions about per-box generation process #4