AUTOMATIC1111 / stable-diffusion-webui

Stable Diffusion web UI
GNU Affero General Public License v3.0
139.52k stars 26.45k forks source link

[Feature Request]: Magicmix support #4538

Open jyapayne opened 1 year ago

jyapayne commented 1 year ago

Is there an existing issue for this?

What would your feature do ?

Provide style transfer based on a prompt. Based on this paper: https://magicmix.github.io/

Example code implementation here: https://github.com/mpaepper/stablediffusion_magicmix

Proposed workflow

  1. Go to img2img

  2. Select an image like normal

  3. Select a script called "MagicMix" or "StyleTransfer" or similar

  4. Input a prompt that you want the image to be like

  5. Select from the following inputs (taken from the ipnb above):

    • nu (v in the paper): controls how much the prompt should overwrite the original image in the initial layout phase. If your result is too close to the original image, try increasing this parameter.
    • total_steps (can use Sampling Steps already in img2img tab): number of inference steps for stable diffusion
      • can be broken into min_steps to max_steps for more control. Or a ratio. Paper recommends min_steps=0.3*total_steps and max_steps=0.6*total_steps so those can be the defaults
    • guidance_scale (can use CFG already in img2img tab): this is the classifier free guidance. The higher this is set, the more it will drive your result towards your prompt.
    • There was one more input in the paper that is not in the above. The s (attention map scale) parameter (value between -2 and 2). It looks like it adds (when positive) or removes (when negative) the prompt to/from the image. Not sure how to use this because I don't understand how the paper defines what an attention map is and how to apply the s parameter to it. Any tips?
  6. Hit generate and wait

Edit: you don't input an image for style transfer, but a prompt. Reworded and added extra information.

Additional information

No response

kostyanchik94 commented 1 year ago

Good job, men! Implement your idea as a custom script and you will be great.

aleksusklim commented 1 year ago

As I understood from .ipynb code, the method comes down to merging the current image on each step with the very original picture:

    t_min = round(0.3 * total_steps)
    t_max = round(0.6 * total_steps)
    layout_steps = list(range(total_steps - t_max, total_steps - t_min))

    encoded = pil_to_latent(input_image)
    noise = torch.randn_like(encoded)

    for i in layout_steps:
      t = scheduler.timesteps[i]
      noisy_latents = scheduler.add_noise(encoded, noise, timesteps=torch.tensor([t]))
      if fine_tuned is not None:
        noisy_latents = nu * fine_tuned + (1-nu) * noisy_latents

(Where fine_tuned is the result from previous step; nu=0.9)

So its like img2img, but the image is provided at several steps, not just at first one.
Also I think that the number of steps for image injection should be configurable (not constant 0.3 as here).

total_steps (=Steps) and guidance_scale (=CFG) are already working in WebUI. Only two parameters needed: nu and t_min.

jyapayne commented 1 year ago

@aleksusklim there is also an extra parameter s that is not in the ipnb that I've added above. But I don't understand how to use it. Maybe someone else can figure it out and explain.

phazei commented 1 year ago

Another implementation I found

https://github.com/cloneofsimo/magicmix