Suggestion: Modify the Unsampler node to generate the noised samples

ttulttul commented 7 months ago

In the DemoFusion pipeline code, they implement the paper's various stages, one of which, of course, is noising the image step by step to produce a set of z' latents:

https://github.com/deroberon/demofusion-comfyui/blob/74559c79da6e4353747e674525865c314d6e2efd/pipeline_demofusion_sdxl.py#L1004C13-L1004C13

In the ComfyUI world, we have the Unsampler node from https://github.com/BlenderNeko/ComfyUI_Noise, which does this but does not currently keep all the intermediate noised samples - it only gives you the final noised sample. In an effort to make this node more Comfy-ish, perhaps we can encourage @BlenderNeko to update the Unsampler node to optionally pump out a batch of latents representing all of the intermediate noising steps, rather than just the final noised sample. See https://github.com/BlenderNeko/ComfyUI_Noise/blob/f227455f930ad1b5766f1a76e1bbdb911adfb85c/nodes.py#L201

Perhaps we can hook into this callback function (https://github.com/BlenderNeko/ComfyUI_Noise/blob/f227455f930ad1b5766f1a76e1bbdb911adfb85c/nodes.py#L233C16-L233C16) and peel off the latest noised latent, adding it to an array that can then be used by the rest of the existing pipeline code?

ttulttul commented 7 months ago

Dr.Lt.Data pointed out that their "Inspire" node pack contains a node that spits out progress latents: https://github.com/ltdrdata/ComfyUI-Inspire-Pack/blob/771392e9dee6aa3b73b83345535569ae0202c8c9/inspire/sampler_nodes.py#L30

This is essentially a batch of latents from the sampling process. It ought to be relatively easy to copy this approach into a new Unsampler node to get a batch of latents that have been progressively noised for our purposes here in DemoFusion land.

ttulttul commented 7 months ago

Please see my latest commit. I added a Batch Unsampler node, which works just like the Unsampler node from ComfyUI Noise, except at each step of unsampling, it collects the intermediate latent. These are concatenated together into a batch and returned at the output of the node. You can then send them to VAEDecode and ImagePreview and see all the intermediate latents. I think that once we have all the intermediate latents, we can then start looking at the next step for DemoFusion, which is to progressively de-noise these latents step by step, mixing in a bit of the re-noised latents (z-prime in the paper) at each step.

ttulttul commented 7 months ago

So my vision here is to have a few different nodes that combine to implement DemoFusion:

Take your latent and resize it by 2x.
Use Batch Unsampler to generate all the latents going back to full noise.
Take this batch of progressively noisier latents through another new node that allows sampling from a batch of latents in the DemoFusion way.

ttulttul commented 7 months ago

Something like that, anyhow.

deroberon commented 7 months ago

Hi, your Unsampler works like a charm. I did a little workflow to visualize it workflow1

It's clearly noisyfying the latents step by step. As you pointed, the generation of the noise is here : https://github.com/deroberon/demofusion-comfyui/blob/74559c79da6e4353747e674525865c314d6e2efd/pipeline_demofusion_sdxl.py#L1004C13-L1004C13

Where the latents are calculated and updated to generate a x_{t-1}

And the first thing we have to do, I guess, is to create an latent input in Demofusion Node, pass these latentes generated by Batch Unsampler to DemoFusionSDXLPipeline and them try to modify this part to use the latents generated by the node, right?

ttulttul commented 6 months ago

And the first thing we have to do, I guess, is to create an latent input in Demofusion Node, pass these latentes generated by Batch Unsampler to DemoFusionSDXLPipeline and them try to modify this part to use the latents generated by the node, right?

I have created (but not pushed) a Batch KSampler node, which takes the batch of latents that have been unsampled and aims to de-noise them, following the DemoFusion approach generally but adding more comfy-like flexibility. For starters, all it does is iterate through the latents and sample each from i to steps. In other words, the first, noisiest latent (z_prime(T)) gets de-noised from 0 to steps. The second (z_prime(T-1)) gets de-noised from 1 to steps, etc.

The batch sampler node ads a reverse switch so that the latents at the input can be flipped giving us the noisiest one first and then proceeding from there.

Anyhow, once I get batch sampling working, I can add in the scaled mixing from the paper using their cosine decay function to gradually blend in less and less of the z_prime latent at each time step.

We then need to figure out how to do the sliding window stuff at each step, which I think will be the hard part.

ttulttul commented 6 months ago

BTW I also re-read the original diffusion paper and realized that you can add noise to a latent using simple math rather than calling out to a sampler. I'll try doing that as well because it will be exponentially faster than running the KSampler against a reversed set of sigmas.

ttulttul commented 6 months ago

Status update - sorry, no commit just yet. I am just about finished applying the blending of z_prime during de-noising after the upscale. Just making it fast.

ttulttul commented 6 months ago

Making progress:

I noticed that Unsampler wasn't really doing what we want. "Unsampling" is a VERY straightforward process. You are just adding normal noise to the original image in accordance with the sigmas schedule, which you can get from the model. This fast and efficient torch code takes a 4D latent batch, x, along with a sigmas tensor that you can get via

sigmas = sampler.sigmas.flip(0) + 0.0001

and applies the sigmas to the latent to generate a batch of progressively noised latents following the sigmas (noise schedule). You don't have to send the latent through a sampler at all and there's no point using Comfy's sampler code for this purpose. Unsampling doesn't rely on the prompt at all; it's the same for all LDM models.

I do not know whether LCM works in the same way, but I suspect it does not, so be warned that this approach may not work with LCM.

def generate_noised_latents(x, sigmas):
    """
    Generate all noised latents for a given initial latent image and sigmas in parallel.

    :param x: Original latent image as a PyTorch tensor.
    :param sigmas: Array of sigma values for each timestep as a PyTorch tensor.
    :return: A tensor containing all noised latents for each timestep.
    """
    # Ensure that x and sigmas are on the same device (e.g., CPU or CUDA)
    device = x.device
    sigmas = sigmas[1:].to(device) # ignore the first sigma
    batch_size = x.shape[0]
    num_sigmas = len(sigmas)

    # Expand x and sigmas to match each other in the first dimension
    # x_expanded shape will be:
    # [batch_size * num_sigmas, channels, height, width]
    x_expanded = x.repeat(num_sigmas, 1, 1, 1)
    sigmas_expanded = sigmas.repeat_interleave(batch_size)

    logger.warning(f"sigmas: {sigmas.view(-1)}")

    # Create a noise tensor with the same shape as x_expanded
    noise = torch.randn_like(x_expanded)

    logger.warning(f"noise: {noise.shape}")
    logger.warning(f"x:     {x.shape}")

    # Multiply noise by sigmas, reshaped for broadcasting
    noised_latents = x_expanded + noise * sigmas_expanded.view(-1, 1, 1, 1)

    logger.warning(f"noised_latents: {x.shape}")

    return noised_latents

ttulttul commented 6 months ago

Work in progress... If you only do the first part of the DemoFusion paper - iterative de-noising while mixing in the noised latents at each step - and guide diffusion using a depth ControlNet, you can get a result like this going from a 512x512 source image up-scaled 4x to 2Kx2K:

deroberon commented 6 months ago

OMG!! It's so amazing what you did in a few days!! And your workflow in the example folder demonstrate perfectly the usage of the Sampler and Unsampler. I've also tried with SDXL models and it kind of works. The roughness in the intermediate steps are amplified.

I also create another workflow in the example folder, that compares the application on the KSampler with just the light denoise, so we can see the impact of the different techniques we are applying with respect to just upscale the latents and denoise it.

deroberon commented 6 months ago

BTW, it's an amazing image!

ttulttul commented 6 months ago

Thanks for the many compliments. I can now appreciate why in the paper they talk about the first technique being insufficient because it produces grainy output. This is why they then do their "dilated sampling".

But the paper isn't super clear as to how dilated sampling works. They talk about getting a series of "global" latent representations by de-noising a Gaussian-blurred version of several parts of the latent, which according to Figure 3 seem to be overlapping. I guess that's what is meant by "dilated". The number of global samples is set to s^2, where s is the scale factor (2, 3, 4, etc), so you start with four global latents and then the next scaling up, you have 8, etc.

These global latents are de-noised and then mixed somehow with what they call "local" representations. I wonder if the local representations just means the z[i] that was just de-noised?

Anyhow I'll read their code and take a crack at it.

ttulttul commented 6 months ago

I also suspect that their technique may not be the best. I'd like to make a node that gives you lots of options and flexibility so that people can try out different things.

Wukong1-137 commented 6 months ago

following this with much enthusiasm... keep up the great work guys!

deroberon / demofusion-comfyui

Suggestion: Modify the Unsampler node to generate the noised samples #2