[Feature Request]: Magicmix support

jyapayne commented 1 year ago

Is there an existing issue for this?

[X] I have searched the existing issues and checked the recent builds/commits

What would your feature do ?

Provide style transfer based on a prompt. Based on this paper: https://magicmix.github.io/

Example code implementation here: https://github.com/mpaepper/stablediffusion_magicmix

Proposed workflow

Go to img2img
Select an image like normal
Select a script called "MagicMix" or "StyleTransfer" or similar
Input a prompt that you want the image to be like
Select from the following inputs (taken from the ipnb above):
- nu (v in the paper): controls how much the prompt should overwrite the original image in the initial layout phase. If your result is too close to the original image, try increasing this parameter.
- total_steps (can use Sampling Steps already in img2img tab): number of inference steps for stable diffusion
  - can be broken into min_steps to max_steps for more control. Or a ratio. Paper recommends min_steps=0.3*total_steps and max_steps=0.6*total_steps so those can be the defaults
- guidance_scale (can use CFG already in img2img tab): this is the classifier free guidance. The higher this is set, the more it will drive your result towards your prompt.
- There was one more input in the paper that is not in the above. The s (attention map scale) parameter (value between -2 and 2). It looks like it adds (when positive) or removes (when negative) the prompt to/from the image. Not sure how to use this because I don't understand how the paper defines what an attention map is and how to apply the s parameter to it. Any tips?
Hit generate and wait

Edit: you don't input an image for style transfer, but a prompt. Reworded and added extra information.

Additional information

No response

kostyanchik94 commented 1 year ago

Good job, men! Implement your idea as a custom script and you will be great.

aleksusklim commented 1 year ago

As I understood from .ipynb code, the method comes down to merging the current image on each step with the very original picture:

    t_min = round(0.3 * total_steps)
    t_max = round(0.6 * total_steps)
    layout_steps = list(range(total_steps - t_max, total_steps - t_min))

    encoded = pil_to_latent(input_image)
    noise = torch.randn_like(encoded)

    for i in layout_steps:
      t = scheduler.timesteps[i]
      noisy_latents = scheduler.add_noise(encoded, noise, timesteps=torch.tensor([t]))
      if fine_tuned is not None:
        noisy_latents = nu * fine_tuned + (1-nu) * noisy_latents

(Where fine_tuned is the result from previous step; nu=0.9)

So its like img2img, but the image is provided at several steps, not just at first one.
Also I think that the number of steps for image injection should be configurable (not constant 0.3 as here).

total_steps (=Steps) and guidance_scale (=CFG) are already working in WebUI. Only two parameters needed: nu and t_min.

jyapayne commented 1 year ago

@aleksusklim there is also an extra parameter s that is not in the ipnb that I've added above. But I don't understand how to use it. Maybe someone else can figure it out and explain.

phazei commented 1 year ago

Another implementation I found

https://github.com/cloneofsimo/magicmix

AUTOMATIC1111 / stable-diffusion-webui