Open Karbo123 opened 2 years ago
It is for classifier-free diffusion guidance. The zero embedding means "no prompt", generate unconditionally. (The model was trained with 20% of the input embeddings replaced at random with the zero embedding.) We can take the difference between the unconditional distribution and the conditional distribution and strengthen it and generate from the artificially "super-conditional" distribution, thus producing images that fit the prompt better than you would get from just inputting the prompt's embedding.
The weight of the zero embedding is supposed to be negative, say you want to generate with a cond_scale of 3. Then your prompt's weight should be 3 and the zero embedding's weight should be -2 (they have to sum to 1 but do not otherwise have restrictions). This is because of the formula for classifier-free guidance with a single prompt:
v = v_uncond + (v_cond - v_uncond) * cond_scale
.
If cond_scale
is 3 then the first term is (1, 0) for uncond and cond and the second is (-3, 3). Add them and you get the weights (-2, 3).
Thanks for the quick reply! These are very helpful! If you don't mind, may I ask some other questions here?
After learning about the diffusion model, I am actually quite curious about whether it is possible to use the denoising process to modify some images directly on the pixel space without manually adding noises or converting back to the noisy states?
As far as I know, diffusion models have been applied to many fields such as image generation, image inpainting and image editing, but they all start from gaussian noises and gradually denoise the images to recover. I also find that you also implement a reverse sampling process which allows to obtain the noisy images from the original clear images. I find that nearly all of them require some noisy images to start their denosing processes. Is it really necessary to start from a noisy image? Is it possible to directly operate on a clean image (without any noise) via iterative updates?
About this, I've conducted some toy experiments. They show that the initial images will not change too much, their shapes only blur a bit, and some high-frequency details are lost after denosing. What surprised me is that the image won't change too much, I originally think that the image will change much because the process starts from 1 and gradually declines to zero during the updates.
Thanks again!
Is it really necessary to start from a noisy image? Is it possible to directly operate on a clean image (without any noise) via iterative updates?
You can use the reverse sampling process to find an artificial "noise" image that will, with the same model, produce a particular clean image, but it may not actually have the statistical properties of real noise! You said some high frequency details were lost, if you increase your step count enough you should be able to recover them too---diffusion models are perfectly invertible in the limit of infinite timesteps.
Really appreciate for the detailed reply! I know my questions might be somewhat easy or stupid, but I think my doubt hasn't been fully resolved.
You did mentioned a possible solution of reversing a clear high-SNR image back to a relative noisy low-SNR image. It seems that in order to utilize a diffusion model, we may have to convert our input clear image back to the noisy one that the model can tackle and denoise.
Is it right that a diffusion model must starts from some noisy images (i.e. images with low SNR with noisy property, and look like a noisy image)? If it starts from a clear image (i.e. not reverse back to a noisy one), is it likely to fail to modify the image contents? If there is any possibility of directly modifying its content not through a noisy state, I think it'd be rather meaningful because it might reduces the iteration steps required to perform image generation (note that the reverse sampling also requires some extra steps to reverse a clear image back to its noisy pattern).
Thanks for this great work. I'm recently interested in using diffusion model to generate images iteratively. I found your script
cfg_sample.py
was a nice implementation and I decided to learn from it. However, because I'm new in this field, I've encountered some problems quite hard to understand for me. It'd be great if some hints/suggestions are provided. Thank you!! My questions are listed below. They're about the scriptcfg_sample.py
.zero_embed
as one of the features for conditioning. What is the purpose of using it? Is it designed to allow the case of no prompt for input?zero_embed
is computed as1 - sum(weights)
, I think the1
is to make them sum to one, but actually the weight ofzero_embed
could be a negative number, should weights be normalized before all the intermediate noise maps are weighted?Thanks very much!!