Fixed the use of undenoised latent from the DPM++ scheduler

czkoko commented 6 months ago

The latent returned by the step function in DPM++ scheduler is not denoised. It seems to be the prediction of the next time step.
I don't know much about it. I don't know if this fix is right, but it does solve the picture quality problem.
The following is the picture quality comparison of DPM++ 2M Karras before and after the fixed.

316493631-11d1cbbb-c5af-48ad-9507-903f0233a170 316493584-314ba683-59f6-4bc1-9cde-2f5108f5adc9

nosferatu500 commented 6 months ago

LGTM

Now I see why I have no issue in my app you described a while ago, I DO exactly the same in my own implementation.

czkoko commented 6 months ago

LGTM

Now I see why I have no issue in my app you described a while ago, I DO exactly the same in my own implementation.

Yes, this problem seems to have existed since Guernika came out. I took a lot of detours before I finally found out that the problem was here.

SpiraMira commented 6 months ago

LGTM Now I see why I have no issue in my app you described a while ago, I DO exactly the same in my own implementation.

Yes, this problem seems to have existed since Guernika came out. I took a lot of detours before I finally found out that the problem was here.

LGTM

Great bug fix! I just checked the hugging face Diffusion application and the denoised latent is derived the same way.

again, great find.

GuiyeC commented 6 months ago

@czkoko thanks for the PR, I'm not being able to reproduce this with either SD1.5 or SDXL, I don't see any real difference in the outputs, maybe you could clarify how you generated those results, also if this is a fix for one specific scheduler I think it would make sense to implement it in Schedulers, it feels really hacky having it here.

@SpiraMira could you explain what you meant with "the hugging face diffusion application does this"? can you share where? I rechecked DPMSolverMultistepScheduler and I don't see it.

SpiraMira commented 6 months ago

@czkoko thanks for the PR, I'm not being able to reproduce this with either SD1.5 or SDXL, I don't see any real difference in the outputs, maybe you could clarify how you generated those results, also if this is a fix for one specific scheduler I think it would make sense to implement it in Schedulers, it feels really hacky having it here.

@SpiraMira could you explain what you meant with "the hugging face diffusion application does this"? can you share where? I rechecked DPMSolverMultistepScheduler and I don't see it.

Hi @GuiyeC - by that I mean the ml-stable-diffusion demo app @ https://github.com/huggingface/swift-coreml-diffusers (written mostly by folks at hugging face).

The issue is at the pipeline level not the scheduler itself. Below is an excerpt of their pipeline denoising loop (in StableDiffusionPipeline:generateImages()). Their final decoded image(s) source is : denoisedLatents[i] = scheduler[i].modelOutputs.last ?? latents[I] as opposed to just latents[I]

The last time I checked the DPMSolverMultistep scheduler implementations were the same. Do you think there’s an issue there?

Note: all this said, I also am straining to see a marked difference between the two different approaches. @czkoko more details on the model, pipeline, scheduler and prompts used would be helpful.


 // De-noising loop
...
...
...            
            // Predict noise residuals from latent samples
            // and current time step conditioned on hidden states
            var noise = try unet.predictNoise(
                latents: latentUnetInput,
                timeStep: t,
                hiddenStates: hiddenStates,
                additionalResiduals: additionalResiduals
            )

            noise = performGuidance(noise, config.guidanceScale)

            // Have the scheduler compute the previous (t-1) latent
            // sample given the predicted noise and current sample
            for i in 0..<config.imageCount {
                latents[i] = scheduler[i].step(
                    output: noise[i],
                    timeStep: t,
                    sample: latents[i]
                )

                denoisedLatents[i] = scheduler[i].modelOutputs.last ?? latents[i]
            }

            let currentLatentSamples = config.useDenoisedIntermediates ? denoisedLatents : latents

            // Report progress
            let progress = Progress(
                pipeline: self,
                prompt: config.prompt,
                step: step,
                stepCount: timeSteps.count,
                currentLatentSamples: currentLatentSamples,
                configuration: config
            )
            if !progressHandler(progress) {
                // Stop if requested by handler
                return []
            }
        }

        if reduceMemory {
            controlNet?.unloadResources()
            unet.unloadResources()
        }

        // Decode the latent samples to images
        return try decodeToImages(denoisedLatents, configuration: config)
`

GuiyeC commented 6 months ago

@SpiraMira I still don't think this is an issue in the pipeline, they are using Apple's ml-stable-diffusion, and they may just be using the same hack to fix this, but I think the scheduler should be returning the correct output, specially if this doesn't happen with other schedulers.

czkoko commented 6 months ago

@GuiyeC

There is no problem with SD 1.5.
I think you should use good fine-tuning models to test. They no longer need refiner. The picture quality generated by SDXL Base is very bad, and you need to generate real portrait photos, so that it is easier to find the details, and the cartoon style may not be easy to find.
Next, I will test with the better Crystal Clear XL:

prompt: cinematic, A fair-skinned woman. 35mm photograph, highly detailed seed: 3850049176 step: 15 cfg: 5.0

Guernika App: (DPM++ 2M) It is obvious that there are serious picture quality problems. If you turn on tae decoding to preview the image, you will see that the picture quality of the last few steps is better than the final image. 2024-04-01 19 02 33

The following is to use the same parameters and use this PR to fix it. DPM++ 2M: dpmpp2m

DPM++ 2M Karras: dpmpp2mKarras

czkoko commented 6 months ago

The following is 15 images of building a simple command-line program to decode each step. Through GuernikaKit 1.6.1, this PR is not applied.

You can see that the final image adds a lot of unclean noise than step 14. DPM++ 2M will be more obvious. dpmpp2mKarras.zip

GuiyeC commented 6 months ago

@czkoko I'm not saying there is no bug, but still I think this is a bug with the scheduler, not with the pipeline, the scheduler should return the correct output, if it's the last step then it should return the latent ready for decoding.

Are you seeing this on any other schedulers?

czkoko commented 6 months ago

I don't know much about it, but compared with other schedulers, it seems that before the step function return, prevSample needs to be processed by the convertModelOutput function. The step function in DPMSolverMultistepScheduler, DPMSolverSinglestepScheduler directly return prevSample.

GuiyeC commented 6 months ago

This seems to be the same issue.

czkoko commented 6 months ago

@GuiyeC If you want to solve the problem in the scheduler, maybe it should be like this. I have tested, that the picture quality return to normal. Do you think this is right?

    public func step(
        // ......
        if lowerOrderStepped < solverOrder {
            lowerOrderStepped += 1
        }
        if stepIndex == timeSteps.count - 1 {
            return convertModelOutput(modelOutput: output, timestep: t, sample: prevSample)
        }
        return prevSample
}

GuiyeC commented 6 months ago

You could return modelOutput earlier and without reconverting the output:

public func step(
    output: MLShapedArray<Float32>,
    timeStep t: Double,
    sample: MLShapedArray<Float32>,
    generator: RandomGenerator
) -> MLShapedArray<Float32> {
    ...

    let modelOutput = convertModelOutput(modelOutput: output, timestep: timeStep, sample: sample)
    if modelOutputs.count == solverOrder { modelOutputs.removeFirst() }
    modelOutputs.append(modelOutput)

    if stepIndex == timeSteps.count - 1 {
        return modelOutput
    }

    let prevSample: MLShapedArray<Float32>
    ...
    return prevSample
}

I still would like to see how they've solved it in other implementations, but maybe this is good enough for a temporary fix.

SpiraMira commented 6 months ago

@czkoko I'm not saying there is no bug, but still I think this is a bug with the scheduler, not with the pipeline, the scheduler should return the correct output, if it's the last step then it should return the latent ready for decoding.

Are you seeing this on any other schedulers?

@GuiyeC and @czkoko - all the non-solver schedulers will return the same denoised latent so modelOutput[0] == latent. I just double checked in the debugger (my own Guernika based app), so the proposed fix will have no effect (bug or not).

I dug deeper into the GuernikaKit:Schedulers:DPMSolverMultistepScheduler code (vs apple's version) and noticed a difference in the secondOrderUpdate function. The last weightedSum parameters are different. The test case uses 15 steps so (I believe) it triggers the secondOrderUpdate call. Could this be an issue ? Also, which version of DPMSolverMultistepScheduler is the correct one ?

SpiraMira commented 6 months ago

@czkoko I'm not saying there is no bug, but still I think this is a bug with the scheduler, not with the pipeline, the scheduler should return the correct output, if it's the last step then it should return the latent ready for decoding. Are you seeing this on any other schedulers?

@GuiyeC and @czkoko - all the non-solver schedulers will return the same denoised latent so modelOutput[0] == latent. I just double checked in the debugger (my own Guernika based app), so the proposed fix will have no effect (bug or not).

I dug deeper into the GuernikaKit:Schedulers:DPMSolverMultistepScheduler code (vs apple's version) and noticed a difference in the secondOrderUpdate function. The last weightedSum parameters are different. The test case uses 15 steps so (I believe) it triggers the secondOrderUpdate call. Could this be an issue ? Also, which version of DPMSolverMultistepScheduler is the correct one ?

Correction: my bad, it looks like the GuernikaKit Schedulers version was just refactored a bit, so functionally the same.

czkoko commented 6 months ago

Liuliu's draw things has just been open source. Maybe you can find some solutions from his code, and his DPM++ SDE has no problem with SDXL, but it takes double the time. https://github.com/drawthingsai/draw-things-community/tree/main/Libraries/SwiftDiffusion/Sources/Samplers

SpiraMira commented 6 months ago

Liuliu's draw things has just been open source. Maybe you can find some solutions from his code, and his DPM++ SDE has no problem with SDXL, but it takes double the time. https://github.com/drawthingsai/draw-things-community/tree/main/Libraries/SwiftDiffusion/Sources/Samplers

@czkoko - is DPM++ 2M or SDE the problem child? I thought it was 2M Karras. Also, which XL model are you testing with? Would love to be able to reproduce your images on my side...

czkoko commented 6 months ago

@czkoko - is DPM++ 2M or SDE the problem child? I thought it was 2M Karras. Also, which XL model are you testing with? Would love to be able to reproduce your images on my side...

@SpiraMira

DPM++ 2M, DPM++ 2M Karras, DPM++ SDE, DPM++ SDE Karras all have problems, among which DPM++ 2M, DPM++ 2M Karras can use the currently method to solve.
But SDE seems to have other problems. For example, the SDXL picture will be hazy, no matter how many steps there are. After merging the lightning lora, hazy disappears, but the color will be red.
I test with Crystal Clear XL.

SpiraMira commented 6 months ago

@czkoko - is DPM++ 2M or SDE the problem child? I thought it was 2M Karras. Also, which XL model are you testing with? Would love to be able to reproduce your images on my side...

@SpiraMira

DPM++ 2M, DPM++ 2M Karras, DPM++ SDE, DPM++ SDE Karras all have problems, among which DPM++ 2M, DPM++ 2M Karras can use the currently method to solve.

But SDE seems to have other problems. For example, the SDXL picture will be hazy, no matter how many steps there are. After merging the lightning lora, hazy disappears, but the color will be red.

I test with Crystal Clear XL.

@czkoko - So I played around with updating our current DPMSolverMulti along the lines of this hugging face patch for a similar problem: https://github.com/huggingface/diffusers/pull/6477/files#diff-517cce3913a4b16e1d17a0b945a920e400aa5553471df6cd85f71fc8f079b4b4 with some qualitative success with the 2M scheduler.

The most significant piece of their patch affects the step function's lowerOrderFinal evaluation:

# Improve numerical stability for small number of steps
         lower_order_final = (self.step_index == len(self.timesteps) - 1) and (
             self.config.euler_at_final
             or (self.config.lower_order_final and len(self.timesteps) < 15)
             or self.config.final_sigmas_type == "zero"
         )

my version:

let lowerOrderFinal = stepIndex == timeSteps.count - 1 && ((useLowerOrderFinal && timeSteps.count < 15) || !useFinalTraingScheduleSigma)

by setting my useFinalTraingScheduleSigma flag off, I achieve the same effect. Here are some shots with the 2M scheduler:

I used a coreml version of Crystal Clear XL (8 bit 1024x1024) from https://huggingface.co/coreml-community/coreml-crystalClearXL_SDXL_8-bit/tree/main/original - using your seed, prompts and settings as best I could
My results with Karras have always been acceptable, so didn’t focus on Karras yet.
I haven’t looked at the single step schedulers
2M scheduler results only
first row is my fix in reverse order steps 17...14
second row is 2M original in reverse order steps 17..14 (noise > 15 steps as expected)

The rest of the patch adds either the final training schedule sigma or 0 during initialization. I’ve implemented it but I’m still not sure of its overall affect on our schedulers.

Looks like the results are similar to the proposed hack, but this may be backed by more solid science and research. (of course assuming hugging face didn’t just hack this too since it made it into their baseline). I’m not an expert. Would love to hear from @GuiyeC about this. (I can submit some code for review if you like)

GuernikaCore / GuernikaKit

Fixed the use of undenoised latent from the DPM++ scheduler #7