crowsonkb / k-diffusion

Karras et al. (2022) diffusion models for PyTorch
MIT License
2.26k stars 372 forks source link

[Feature request] Let user provide his own randn data for samplers in sampling.py #25

Closed AUTOMATIC1111 closed 1 year ago

AUTOMATIC1111 commented 1 year ago

Please add an option for samplers to accept an argument with random data and use that if it is provided.

The reason for this is as follows.

We use samplers in stable diffusion to generate pictures, and we use seeds to make it possible for other users to reproduce results.

In a batch of one image, everything works perfectly: set seed beforehand, generate noise, run sampler, and get the image everyone else will be able to get.

If the user produces a batch of multiple images (which is desirable because it works faster than multiple independent batches), the expectation is that each image will have its own seed and will be reproducible individually outside of the batch. I achieve that for DDIM and PLMS samplers from stable diffusion by preparing the correct random noise according to seeds beforehand, and since those samplers do not have randomness in them, it works well.

Samplers here use torch.randn in a loop, so samples in a batch will get different random data than samples produced individually, which results in different output.

An example of what I want to have:

from

def sample_euler_ancestral(model, x, sigmas, extra_args=None, callback=None, disable=None):
    """Ancestral sampling with Euler method steps."""
    extra_args = {} if extra_args is None else extra_args
    s_in = x.new_ones([x.shape[0]])
    for i in trange(len(sigmas) - 1, disable=disable):
        denoised = model(x, sigmas[i] * s_in, **extra_args)
        sigma_down, sigma_up = get_ancestral_step(sigmas[i], sigmas[i + 1])
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
        d = to_d(x, sigmas[i], denoised)
        # Euler method
        dt = sigma_down - sigmas[i]
        x = x + d * dt
        x = x + torch.randn_like(x) * sigma_up
    return x

to

def sample_euler_ancestral(model, x, sigmas, extra_args=None, callback=None, disable=None, user_random_data=None):
    """Ancestral sampling with Euler method steps."""
    extra_args = {} if extra_args is None else extra_args
    s_in = x.new_ones([x.shape[0]])
    for i in trange(len(sigmas) - 1, disable=disable):
        denoised = model(x, sigmas[i] * s_in, **extra_args)
        sigma_down, sigma_up = get_ancestral_step(sigmas[i], sigmas[i + 1])
        if callback is not None:
            callback({'x': x, 'i': i, 'sigma': sigmas[i], 'sigma_hat': sigmas[i], 'denoised': denoised})
        d = to_d(x, sigmas[i], denoised)
        # Euler method
        dt = sigma_down - sigmas[i]
        x = x + d * dt
        x = x + (torch.randn_like(x) if user_random_data is None else user_random_data[i]) * sigma_up
    return x

(difference only in next-to-last line)

Birch-san commented 1 year ago

this could help, in the case of Karras samplers at the default of churn=0:
https://github.com/crowsonkb/k-diffusion/pull/30

Birch-san commented 1 year ago

regarding user_random_data API: I think a callback would be a more flexible API. so user passes in a Callable[[int], Tensor] — which lets the sampler ask for "the rand tensor for this step".

a more opinionated way to do this could be to accept a Generator, so the user can control the RNG source without having to know too much about how it'll be used.

crowsonkb commented 1 year ago

A callable sounds like a good idea. :)

keturn commented 1 year ago

I've found that it's very difficult to use a torch.Generator for such an application, and suggest using coordinate-based procedural noise.

crowsonkb commented 1 year ago

I didn't want to use a torch.Generator because then you can't do things like "expand"/add details to, or downsample, paths of Brownian motion to try and get a similar path using a different number of steps.

AUTOMATIC1111 commented 1 year ago

Callable would be fine too. As it is now I prepare the noise beforehand and replace your module's torch with a torch that has randn_like rigged to return my prepared tensors.

https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/c715ef04d1edb1a112a602639ed3bb292fdeb0e2/modules/sd_samplers.py#L224

crowsonkb commented 1 year ago

I added the callable version in a branch: https://github.com/crowsonkb/k-diffusion/tree/noise-samplers, the ancestral samplers take a noise_sampler= argument now that is a callable with two arguments, sigma and sigma_next, the interval to return the noise for. I also added a (non-default) torchsde.BrownianTree based noise sampler which produces more stable samples across different numbers of steps and different ancestral samplers (they should in fact converge to the same limiting image using this given enough steps). You use it like this:

ns = K.sampling.BrownianTreeNoiseSampler(x, sigma_min, sigma_max)
samples = K.sampling.sample_something_ancestral(..., noise_sampler=ns)

I'd like people to try it out, it's pretty neat!

crowsonkb commented 1 year ago

BrownianTreeNoiseSampler also supports batches of seeds, K.sampling.BrownianTreeNoiseSampler(x, sigma_min, sigma_max, seed=[1, 2, 3, 4]), seeds must be nonnegative integers.

Birch-san commented 1 year ago

amazing! definitely keen to try that out. what does "more stable" mean? do you mean that a wider variety of step counts (in particular lower step counts) will succeed in finding the "converged" image?

crowsonkb commented 1 year ago

"New York City, oil on canvas", guidance scale 5.5, all samples drawn with eta=1:

12 steps, DPM++ 2S: grid-0271-2

24 steps, DPM++ 2S: grid-0272-2

50 steps, Euler ancestral: grid-0275-2

Birch-san commented 1 year ago

@crowsonkb outstanding work; I'll give it a try.

another thing that's important for stable convergence is precision.

Karras for example creates noise in float64:
https://github.com/NVlabs/edm/blob/0a2ff34f0b80415ace7af7311074bd2255da0d1e/generate.py#L35

runs the model in its native dtype (which may be low-precision), but casts the result to float64 again:
https://github.com/NVlabs/edm/blob/0a2ff34f0b80415ace7af7311074bd2255da0d1e/generate.py#L50

@marunine says error can be minimized by keeping as much calculating as possible in high-precision, especially when multiplying against sigmas.

any thoughts on providing a way to opt-in to mixed-precision?
note: on Mac we cannot use double-precision on-GPU, so it might be useful to also have a way to choose the device used.

currently t_to_sigma() is hardcoded to float32 for example (which I had to override in order to sample from an fp16 UNet):
https://github.com/crowsonkb/k-diffusion/blob/60e5042ca0da89c14d1dd59d73883280f8fce991/k_diffusion/external.py#L81

Birch-san commented 1 year ago

Wow, Brownian tree noise sampler makes convergence way more stable!

sample_dpmpp_2s_ancestral

10, 15, 20, 25, 30, 35, 40, 100 steps

Brownian noise

Default noise

1 of my attempts at generating 35-step sample, and 2 of my attempts at generating 100-step sample, resulted in all-black image (maybe due to ±Inf or NaN). this is a known problem with high step counts on stable-diffusion, that people were having since day 1, even on the built-in DDIM/PLMS samplers. think not a k-diffusion problem, and may even be a Mac-specific problem.

Interestingly: this happened for 3 image attempts on Default noise, but never for Brownian noise. Could be a coincidence.

sample_dpm_adaptive

Steps unspecified — am I understanding correctly that DPM adaptive just continues sampling until it reaches the fully-converged image? so this is what the fully-converged image should look like?

Is it surprising that neither default noise nor Brownian noise reached the sample that we got via sample_dpm_adaptive?

crowsonkb commented 1 year ago

For DPM Adaptive I think you might want to decrease the per-step error tolerances atol and rtol, the defaults are kind of lax and don't reach the converged image. (Also make sure to pass in the Brownian tree.)

I am working on a version of DPM++ 2S that is even better at resembling the converged image with lower step counts, it does a noise addition between the first stage and the second and then uses the Brownian tree to replay part of the noise for the noise addition after the second stage. This new sampler is particularly good for old-style CLIP guided diffusion too, because it prevents the model from ever seeing an x_t dragged off distribution by CLIP without a noise addition in between to put it back on distribution. I just need to know what to name it...

Birch-san commented 1 year ago

so with sample_dpm_adaptive

make sure to pass in the Brownian tree

first is Brownian, second is default:

both ran for 51 iterations

decrease the per-step error tolerances atol and rtol

all using Brownian, halving tolerances each time…

rtol=0.05, atol=0.0078 (default), 51 iterations:

rtol=0.025, atol=0.0039, 57 iterations:

rtol=0.0125, atol=0.00195, 69 iterations:

rtol=0.00625, atol=0.000975, 84 iterations:

rtol=0.003125, atol=0.0004875, 105 iterations:

hmmm I'm not getting the impression the tolerances were too high to allow convergence. feels like there's something else going on here, preventing sample_dpm_adaptive and sample_dpmpp_2s_ancestral from converging to the same image.
this is float32 btw. I can't check float64 on-GPU on Mac.

crowsonkb commented 1 year ago

so with sample_dpm_adaptive

make sure to pass in the Brownian tree

first is Brownian, second is default: both ran for 51 iterations

Since those are the same I think you need to set eta to something other than the default of 0, because it isn't using the random noise at all.

crowsonkb commented 1 year ago

Fixed by https://github.com/crowsonkb/k-diffusion/commit/7621f11f786ec17119ae9cac8e88971b822a4bbe.

Birch-san commented 1 year ago

Since those are the same I think you need to set eta to something other than the default of 0, because it isn't using the random noise at all.

@crowsonkb

okay yeah, running sample_dpm_adaptive with eta increased to 0.734375 or 0.75, seems to resemble the pose that sample_dpmpp_2s_ancestral was producing.
I used the same Brownian tree noise seed on both samplers.

target we're trying to converge on (sample_dpmpp_2s_ancestral, 100 steps):

sample_dpm_adaptive, eta:
0.0, 0.5, 0.625, 0.6875, 0.71875, 0.734375, 0.75, 1.0

is eta supposed to behave like a scale where "more eta = more converged"? does it range from 0.0 to 1.0?
eta=1.0 didn't resemble the pose as much. is that likely to just be because rtol and atol were too high, so too much error was tolerated and we didn't get a representative converged image?

none of sample_dpm_adaptive's results had as large sleeves as sample_dpmpp_2s_ancestral was producing. is that again likely to just be down to too-high rtol and atol?

in the sample_dpmpp_2s_ancestral results I ran previously: from 25 steps onwards, further sampling produced very little change. does that imply that the high-step sample_dpmpp_2s_ancestral results we've seen are a good representation of the "converged" image? furthermore does that imply that sample_dpm_adaptive should be able to converge on the same image (i.e. with the same pose and large sleeves) given the right configuration?

Birch-san commented 1 year ago

oh wow, halving rtol to 0.025 does help sample_dpm_adaptive produce big sleeves similar to the ones sample_dpm_adaptive converged on.

target we're trying to converge on (sample_dpmpp_2s_ancestral, 100 steps):

sample_dpm_adaptive
eta=0.75
atol 0.0078 (default)
rtol=0.05 (default), 0.0375, 0.025, 0.0125

hm, when rtol is reduced even more (to 0.0125): it diverges from what sample_dpmpp_2s_ancestral produced. does that make it likely to be a more representative result (of what the model would return if you denoised on all 1000 timesteps)? unintuitive, since the sample_dpmpp_2s_ancestral seemed to be already-converged, based on the fact that sampling for more steps ceased to produce further significant change in it, so it feels like that's the result I should treat as more representative?