Add Stable Diffusion ControlNet

joelpaulkoch commented 4 months ago

I want to share my work on using ControlNet with Stable Diffusion. These are the three parts and notes on current limitations:

ControlNet model
- there is a conditioning scale parameter in the diffusers implementation, at the moment I have a constant of 1 for this. Would you add it as (optional) input?
- regarding testing: there is "hf-internal-testing/tiny-controlnet", but this only returned zeros for me (in diffusers and bumblebee), so I kept the test with "lllyasviel/sd-controlnet-scribble".
UNet
- I added a new model architecture :with_additional_residuals and separate input and core functions. But in the end the only difference is that additional residuals are passed in and added. So alternatively this could go in the :base architecture as optional inputs and add layers I guess.
Stable Diffusion with ControlNet
- Similarly, I've copied the existing stable diffusion implementation and adapted it to support the control net. It might be better to have it in the existing StableDiffusion module.
- the current implementation accepts a u8 tensor of the correct size as conditioning image. Preprocessing is converting the tensor to f32. It might make sense to be more lenient e.g. resize the conditioning image as part of the preprocessing (?)

I've tried all the control nets listed here with the corresponding example and got sensible results for all but the normal map one. I'm not sure what's the issue with the normal map, but could imagine it's because of the preprocessing, or I simply did not run enough steps.

jonatanklosko commented 4 months ago

Hey @joelpaulkoch, thanks for the PR! I will have a more detailed look later, for now a couple high-level comments :)

there is a conditioning scale parameter in the dif fusers implementation, at the moment I have a constant of 1 for this.

Having is a optional serving input sounds good (similar to how we have :seed, for example).

I added a new model architecture :with_additional_residuals and separate input and core functions.

Since the difference is only in inputs, I would totally just have them as optional inputs in the :base architecture, yeah. This also more closely matches what diffusers do.

Similarly, I've copied the existing stable diffusion implementation and adapted it to support the control net. It might be better to have it in the existing StableDiffusion module.

I think a separate module makes sense, in general I would have one module per diffusion type (SD, SD control net, SD XL, ...) and then a serving function for a task (currently only text_to_image, but could be image_to_image, and so on). This would roughly correspond to diffusers, such that a serving function corresponds to a pipeline class and a module to the pipeline grouping directory they have.

Preprocessing is converting the tensor to f32.

I see in diffusers they have VaeImageProcessor, though if in this case it always comes down to converting into f32, then it's probably fine to just be a function.

jonatanklosko commented 2 months ago

We could share more logic between the servings, but it's fine for now, we can refactor once there are more :)

jonatanklosko commented 2 months ago

Btw. I updated the tests to use tiny checkpoints and generated reference values using hf/diffusers :)

elixir-nx / bumblebee

Add Stable Diffusion ControlNet #359