[Discussion] SDXL Turbo mode

Danamir commented 11 months ago

I played with the new SDXL Turbo checkpoint this morning, and here are a few notes relevant to krita-ai-diffusion.

StableDiffusion SDXL Turbo

The base model from Stable Diffusion (available here) works blazingly fast at 1 to 4 steps, and 1.0 CFG. But it is limited to SD 1.5-like resolutions, ie. around 512x512. Many samplers don't agree too well with it, and the negative prompts are completely disabled.

On the positive side ControlNet is working, but can be difficult to fine-tune because of the few steps involved. With a hires step, its quality is better than LCM mode in my humble opinion. Still, it would be quite a pain to handle in the krita plugin, because of the resolution limitation and the subsequent upsampling needed.

Custom-made models

That being said, a better solution seems readily available. Many SDXL models merged with Turbo are popping everywhere. I tested TurboVisionXL (since this is based on DynaVision which is quite good).

The major problem above is solved : the base resolution is from SDXL (around 1024x1024), and you just have to set it to something like 6 steps and cfg 1.0 with LCM sampler to get it working right now. It works best with DPM++ SDE (Karras or normal), which is not available directly in the plugin. I tried it with a minor code modification, and the results are really good for this speed.

There is still the problem of the low steps number limiting the range of denoise values used, but still.

Implementation in krita-ai-diffusion

Just adding the DPM++ SDE sampler seems to be the quickest way to use the Turbo models. (Note: when the advanced samplers are added, there may be a need to set the min steps in the style definition).

The live mode could eventually be updated to have an option to use the base Turbo model at 512x512 and 1 step, it is ridiculously fast.

On a future note, I'm not sure if the Turbo mode can be converted to a LoRA as was the case with LCM. If that's the case, a simple LoRA injection would be enough to activate a turbo mode. Otherwise a full model download may be needed.

Danamir commented 11 months ago

Well, look what just appeared : SDXL Turbo-LoRA 😅

Acly commented 11 months ago

The LoRA doesn't work at 1 step, but results at 3/4 seem good.

I'm wondering what effect SDTurboScheduler actually has, it doesn't seem to do much?

Danamir commented 11 months ago

I'm quite disappointed with the LoRA, I can't get results approaching TuboVisionXL. I'll let a few days to the LoRA makers to see if a better version emerges.

I tried SDTurboScheduler and the SamplerCustom, versus a normal step and KSamplerAdvanced. The results were a little bit different, but not really better or worse. The ControlNet effects seemed to be stronger with the SamplerCustom tho.

Acly commented 11 months ago

If the LoRA works out it would simply replace LCM. Assuming it provides better quality/speed trade-off, at least on first impression the full turbo models does.

If not it's in a difficult spot where it's not clear if improved quality makes up for loss of flexibility.

Danamir commented 11 months ago

Okay I found settings that works for me with Turbo LoRA. DPM++ SDE (Karras or not) sampler, LoRA at 40%, 6 steps, 2.5 CFG scale.

It's working with the previous LoRA, but even better with this one PAseer-SDXL-LCM and Turbo that mixes LCM and Turbo. LoRA at 50%, 6 steps, 3.0 CFG.

The quality is pretty equivalent, but the latter has better prompt coherency at SDXL resolution, and thus suffers less from the "long limbs syndrome" in portrait mode, and from the "duplicate syndrome" in landscape mode.

Strangely enough, the Turbo LoRA seems to strengthen any style applied, where the LCM+Turbo one seems to lessen it.

I'm also seeing a dip in performance when generating with a Turbo model or a normal model + Turbo LoRA. I'm down to 1.20 it/s from 2.3 it/s . Almost twice as slow ! But since I use 3x less steps, there is still an improvement. Do you have the same results speed wise ? It may be on my system only.

Some tests, left : normal, middle : LCM+Turbo, right : Turbo

forge

forge_2

Danamir commented 11 months ago

FYI, the same prompts with turbovisionXL . The quality is very good, the speed the same as with the LoRAs, and it also suffers from the long limbs syndrome :
forge_3 forge_4

Acly commented 11 months ago

I'm also seeing a dip in performance when generating with a Turbo model or a normal model + Turbo LoRA. I'm down to 1.20 it/s from 2.3 it/s . Almost twice as slow !

I see no speed difference regarding model/lora. Are you comparing DPM++(SDE) to DPM++2M(SDE)? The former evaluate the model twice at each step, so it's always twice as slow. But typically need only half the steps.

Your parameters work for me too, but it's interesting how quickly quality deteriorates if you go outside the "good" range.

SDE/Ancestral samplers seem to work a lot better (all the converging samplers give poor result in comparison for me)
The "slow" 2-evaluation samplers seem to work better (when comparing them using half the steps)

Eg. DPM++2s ancestral is also quite nice. It also works to increase Lora weight, and decrease CFG+steps proportionally (gets continually faster at cost of quality & control). But at no point the Lora becomes as good as the Turbo model for 1-step generation.

Danamir commented 11 months ago

Are you comparing DPM++(SDE) to DPM++2M(SDE)

~~Nop, that was DPM++ SDE everywhere, it may have something to do with my ComfyUI install.~~

[edit] : HaHa! good call. I checked my workflow twice, it was effectively in DPM++ 2M SDE. 😅 In my defense I was working on my experimental branch where I can activate a split rendering forcing the start to be in DPM++ SDE before switching to the style selected sampler ; I just forgot I had it deactivated before testing the Turbo mode.

It also works to increase Lora weight, and decrease CFG+steps proportionally

Yeah that was my approach. I started at 100% lora, and very low CFG + steps, then I increased / decreased the values accordingly.

Danamir commented 11 months ago

After some more testing, I'm not sure I like the LCM + Turbo lora anymore. It's more stable at SDXL resolution, but you loose much creativity. All the results seems to be variations of the same pose, and the styles are much less prevalent.

A good solution seems to be the first Turbo LoRA at 50%, DPM++ SDE sampler, 6 steps, 1.5 CFG (with this lora, it seems the low cfg does not impact the contast as much as with the other one, and it reduces the long limbs appearing), and most importantly : 768px base instead of 1024px (just set the SDXL resolution * 0.75).

I makes kinda sense since this is a 512 lora mixed with a 1024 model. 😅

Danamir commented 11 months ago

Rahh, I don't kwow anymore. 😬 The best option may be to give the user the choice of turbo method.

Last round of samples, for the road. The LCM+Turbo feels slightly closer to the original picture.

SDE, 18 steps, 7.0 cfg : grid-normal

SDE, 50% LCM+Turbo, 2.5 cfg :
grid-lcm-turbo-2 5cfg

SDE 50% Turbo, 1.5 cfg :
grid-turbo-1 5cfg

Acly commented 11 months ago

All of [50% Turbo, 50% Turbo+LCM, Merge a la TurboVision] seem roughly equivalent to me, perhaps some trade-offs here and there, but they work in the same realm regarding step count, CFG, speed, quality. I'd give the edge to Turbo+LCM Lora only because it seems a bit more flexible regarding resolutions.

Adding the DPM++SDE sampler is very low effort way to support them. For the turbo base model and Lora we'd ideally also figure out a way to detect them and make them prefer 512x512ish resolutions, but I don't see that being worth it for now.

I did some testing with the turbo base model for Live mode:

1-step 1-CFG txt2img is super fast, good quality, but not very useful (no image input)
1-step 1-CFG txt2img with control net works, but not very well, and performance drops by factor 2 - slower than 6 step LCM!
~3-step ~1.5 CFG img2img adapts poorly to image content, too few steps I think
I can't get good results with more steps using regular samplers. Maybe the special turbo sampler is needed, but no idea how to use it for img2img?
6-step ~1.5 CFG LCM sampler with Turbo (Lora or base) - this works, it essentially allows you to use a SDXL checkpoint to generate at 512x512 with the same speed as SD1.5 LCM. Quality feels worse though. Very hard to compare since you either compare SD1.5 with SDXL, or you compare 512 with 1024.

Acly commented 10 months ago

Closing:

Turbo models can be used efficiently now
Require manual resolution configuration, but I don't know how it could be detected
LCM is a better fit for Live (due to better img2img at low samples)

Acly / krita-ai-diffusion