city96 / SD-Latent-Interposer

A small neural network to provide interoperability between the latents generated by the different Stable Diffusion models.
Apache License 2.0
186 stars 7 forks source link

SD 1.x to SDXL refiner #4

Open holwech opened 8 months ago

holwech commented 8 months ago

Hey! Very cool that you've made this! I tried to combine you converter with SD 1.x and the SDXL refiner but so far I haven't had much luck. Is this something you've managed to do successfully?

Here is the code I've used to combine SD 1.x and the SDXL refiner:

https://colab.research.google.com/drive/1lUHih8KsSGuKFTfYBz0I-6FkMEU5GkdP?usp=sharing

Here is an example of what I get out from the refiner atm:

image

city96 commented 8 months ago

Hi! I did a quick test and using the refiner with 1.5 works with the ComfyUI node, meaning the issue is somewhere else.

Could it be a scaling issue? Just looking at the code it looks like you're passing the scaled latents directly to the refiner. Try modify the code like this:

scaled_latents = 1 / 0.18215 * latents
sdxl_scaled_latents = convert(scaled_latents.to(dtype=torch.float32), "v1", "xl", torch.float32, torch_device)
sdxl_latents = 0.18215 * sdxl_scaled_latents
holwech commented 8 months ago

I made a simplified notebook to limit the number of potential issues. Still having the same problem as in the previous notebook unfortunately.

I'm not very familiar with ComfyUI and I don't know how exactly you connect the two models, so it's hard for me to pinpoint what the issue is. Could you share some details on how you connected SD1.5 and the refiner and/or can you have a look at the simplified notebook to see if there are any obvious issues?

Could it be that the interposer was trained on a specific vae and the default vae for SD1.5 are not compatible?

city96 commented 8 months ago

Your notebook asks me to log in, which I assume means it's set to private. Could you check the visibility settings?

ComfyUI is just a node-based frontend to the LDM code, so internally it uses the same models/etc as diffusers, so that shouldn't matter in this case.

Here is a quick and dirty example of the refiner being connected to the output of a 1.5 model. (Officially, this isn't quite correct, since you're supposed to return the noisy latent at around 80% denoise, then pass it to the refiner for the final 20%, but it works as an example here.)

EXAMPLE

I don't think it's a VAE incompatibility issue either, the encoder part is the same for all v1.5 VAE as far as I know.

I can try to write some example code for how to use this with diffusers if you want. I still suspect it's a scaling issue.

city96 commented 8 months ago

Not great but it works. Oddly enough the v1 pipe doesn't have a denoising_end option but you can just use the custom sampler like you were doing in your original notebook to do a partial denoise.

Code below ```py import torch from diffusers import StableDiffusionPipeline, StableDiffusionXLPipeline # Load pipelines pipe = StableDiffusionPipeline.from_single_file( r"D:\Software\AI\sd-models\checkpoints\mix\Silicon29_dark.safetensors", load_safety_checker=False, # takes forever to download torch_dtype=torch.float16, ) pipe.enable_xformers_memory_efficient_attention() refiner = StableDiffusionXLPipeline.from_single_file( r"D:\Software\AI\sd-models\checkpoints\sd\sdxl_v1.0_refiner.safetensors", torch_dtype=torch.float16, ) refiner.enable_xformers_memory_efficient_attention() # Generate image on SDv1 pipe.to("cuda") scaled_latent = pipe( prompt, height = 1024, width = 1024, output_type = "latent", # denoising_end = 0.90, # doesn't work on v1 num_inference_steps=20, ).images[0] del pipe # free VRAM # Convert latent latent = scaled_latent * (1/0.18215) xl_latent = convert_latent(latent, "v1", "xl") # code for the interposer, from your notebook xl_scaled_latent = xl_latent * 0.18215 # Finish with refiner refiner.to("cuda") image = refiner( prompt = prompt, image = xl_scaled_latent, denoising_start = 0.90, num_inference_steps=20, ).images[0] del refiner # free VRAM image.show() ```
holwech commented 8 months ago

Awesome! Thanks for a thorough answer. It definitely seems like the issue was the scaling. With you code I got some more acceptable output.

I made the notebook public, so it should be possible to view it now.

In the notebook I made a simple test and I'm curious to get your opinion on whether this is the expected quality or not.

import requests
import torch
from PIL import Image
from io import BytesIO
import torchvision.transforms as transforms
from diffusers.image_processor import VaeImageProcessor
import gc
from diffusers import AutoencoderKL

generator = torch.manual_seed(0)
response = requests.get("https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg")
init_image = Image.open(BytesIO(response.content)).convert("RGB")
init_image = init_image.resize((768, 512))

# Processing
sd_vae = AutoencoderKL().from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="vae", variant="fp16", torch_dtype=torch.float16).to("cuda")
vaeImageProcessor = VaeImageProcessor(2 ** (len(sd_vae.config.block_out_channels) - 1))
init_pre_image = vaeImageProcessor.preprocess(init_image).to(dtype=torch.float16, device="cuda")

# Encode
sd_latents = sd_vae.encode(init_pre_image).latent_dist.sample(generator)
#sd_latents = sd_latents * (1/0.18215)

# Convert
sdxl_latents = convert(sd_latents, "v1", "xl", torch.float16, "cuda").to(dtype=torch.float32)
sdxl_latents = sdxl_latents * 0.18215

# Decode
sdxl_vae = AutoencoderKL().from_pretrained("stabilityai/sdxl-vae").to("cuda")
image_tensor = sdxl_vae.decode(sdxl_latents / sdxl_vae.config.scaling_factor, return_dict=False)[0]

# Post-processing
image = vaeImageProcessor.postprocess(image=image_tensor.detach())[0]
image

Input image:

image

Output image:

image

As you can see, it has some artifacts. I could've done something wrong there though, as I inferred a lot of the steps from the diffusers library and it has a lot of stuff going on.

holwech commented 8 months ago

Here is an example from the notebook with 80% steps on SD1.x and 20% on the refiner. Not getting great results unfortunately :(

SD1.5 output: image

Refiner output: image

city96 commented 8 months ago

That quality looks similar to what I get, maybe a bit worse but that could be from you running it in FP16. It's a tiny model so I'd recommend keeping the cast you had in the first nodebook and using it with FP32, though not sure how much that changes. It could also be a clamping difference on the output, hardware differences, etc, etc...

(Also noticed you were using the default XL VAE. I usually use this one since it lets me use FP16, though there's no noticeable difference in terms of visual quality.)

Doing v1=>xl is a lot harder than xl=>v1 because the XL latent contains more information than the V1 latent, so I could never get it 100% perfect since it has to "make up" fake details to fit the format I'm pretty sure.

For the generation example, I think the image degradation you're seeing might be from the fact that you're passing a fully denoised latent into the refiner. As I noted above, there's no denoising_end option for v1 to get the noisy latent, so you'd have to use what you did in your first notebook - add a custom sampler except stop it a few steps before the final one. You could also try euler a for the refiner, which adds noise at every step so it might alleviate it a bit.

Again, I'm just guessing. You could also do a 3 stage thing where your initial image is v1 512x512, then upscale it and send it to v1 1024x1024 before sending it to the refiner. v1 doesn't like generating at resolutions that high natively. (xl=>v1 is simpler since v1 can handle the 1024x1024 image from xl nicely as there it's basically img2img at a low denoise, meaning no weird hires repetition problems appear.)

TomLucidor commented 7 months ago

@city96 thank you for the great work and I hope that there will be a new version with less artifacts, and that latent space expansion is a tough problem indeed. Here is a small question: can a LoRA or embedding be transferred the same way?

@holwech could you interpolate between 70-30, 80-20, and 90-10 to see if the issue is "too much" or "not enough"?