cloneofsimo / lora

Using Low-rank adaptation to quickly fine-tune diffusion models.
https://arxiv.org/abs/2106.09685
Apache License 2.0
7.08k stars 481 forks source link

classifier free guidance during training and inference #254

Open Lucky-Light-Sun opened 1 year ago

Lucky-Light-Sun commented 1 year ago

Hi, thanks for your owesome contributions. Thesedays I wanna to do some work about stable diffusion(SD), so I carefully look through your source code. I have some questions about it, and hope to receive your answer.

  1. classifier free guidance

According to Paper:High-Resolution Image Synthesis with Latent Diffusion Models, Stable Diffusion uses classifier free guidance to train LDM model and gets fantastic images. Algorithm 1 and Algorithm 2 in Classifier-Free Diffusion Guidance, tell us how to use the guidance to train and infer.

In lora_diffusion/cli_lora_pti.py file, the loss_step function in trainning code seems that it does not use classifier free guidance cause we don't set the textual condition to be ""(empty string) with some probability. So I want to ask, is that because we do not need the classifier free guidance during fine-tune process or we just forget to use it? And what's more, just diving into the StableDiffusionInpaintPipeline in huggingface official lib diffusers I see they using the classifier free guidance during inference. As we can see the code below:

# StableDiffusionInpaintPipeline    _encode_prompt function:
if do_classifier_free_guidance and negative_prompt_embeds is None:
            uncond_tokens: List[str]
            if negative_prompt is None:
                uncond_tokens = [""] * batch_size
...
...

# StableDiffusionInpaintPipeline    __call__ function:
# perform guidance
if do_classifier_free_guidance:
        noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
        noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

Here is the code about textual information used during train. We can see that the text is just the raw text provided from the dataset captions, and isn't set as the empty string with some probability such as 50%. So I guess we don't use classifier-free guidance during fine-tine process.

https://github.com/cloneofsimo/lora/blob/bdd51b04c49fa90a88919a19850ec3b4cf3c5ecd/lora_diffusion/cli_lora_pti.py#L315-L324

I'm also wondering that can we just consider the image condition as the textual condition and use the classifier-free guidance? Cause during StableDiffusionInpaintPipeline call function, I don't see the classifier free guidance for image condition. They just use the textual info for classifier free guidance.

In the code below, we can see latent_model_input is just concated with mask and masked_image_latents. Maybe we can also use classifier-free guidance for image information.

https://github.com/cloneofsimo/lora/blob/bdd51b04c49fa90a88919a19850ec3b4cf3c5ecd/lora_diffusion/cli_lora_pti.py#L306-L313

Thank you for reading the above content. Currently, I am also conducting relevant experiments, and I hope you can provide some valuable suggestions!

  1. inpainting code about blur_amount parameter

In your code, I see the define of the parameter, blur_amount but I don't see you use it.

https://github.com/cloneofsimo/lora/blob/bdd51b04c49fa90a88919a19850ec3b4cf3c5ecd/lora_diffusion/cli_lora_pti.py#L853

I think maybe you forget to use it here:

https://github.com/cloneofsimo/lora/blob/bdd51b04c49fa90a88919a19850ec3b4cf3c5ecd/lora_diffusion/dataset.py#L214-L219

because the official defination of this function is below:

def face_mask_google_mediapipe(
    images: List[Image.Image], blur_amount: float = 80.0, bias: float = 0.05
) -> List[Image.Image]

My English writing is not good, I hope you can forgive me

Finally, thank you for watching and looking forward to your reply!