Hi, thanks for your owesome contributions.
Thesedays I wanna to do some work about stable diffusion(SD), so I carefully look through your source code. I have some questions about it, and hope to receive your answer.
In lora_diffusion/cli_lora_pti.py file, the loss_step function in trainning code seems that it does not use classifier free guidance cause we don't set the textual condition to be ""(empty string) with some probability. So I want to ask, is that because we do not need the classifier free guidance during fine-tune process or we just forget to use it?
And what's more, just diving into the StableDiffusionInpaintPipeline in huggingface official lib diffusers I see they using the classifier free guidance during inference. As we can see the code below:
# StableDiffusionInpaintPipeline _encode_prompt function:
if do_classifier_free_guidance and negative_prompt_embeds is None:
uncond_tokens: List[str]
if negative_prompt is None:
uncond_tokens = [""] * batch_size
...
...
# StableDiffusionInpaintPipeline __call__ function:
# perform guidance
if do_classifier_free_guidance:
noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
Here is the code about textual information used during train. We can see that the text is just the raw text provided from the dataset captions, and isn't set as the empty string with some probability such as 50%. So I guess we don't use classifier-free guidance during fine-tine process.
I'm also wondering that can we just consider the image condition as the textual condition and use the classifier-free guidance? Cause during StableDiffusionInpaintPipelinecall function, I don't see the classifier free guidance for image condition. They just use the textual info for classifier free guidance.
In the code below, we can see latent_model_input is just concated with mask and masked_image_latents. Maybe we can also use classifier-free guidance for image information.
Hi, thanks for your owesome contributions. Thesedays I wanna to do some work about stable diffusion(SD), so I carefully look through your source code. I have some questions about it, and hope to receive your answer.
According to Paper:High-Resolution Image Synthesis with Latent Diffusion Models, Stable Diffusion uses
classifier free guidance
to train LDM model and gets fantastic images. Algorithm 1 and Algorithm 2 in Classifier-Free Diffusion Guidance, tell us how to use the guidance to train and infer.In lora_diffusion/cli_lora_pti.py file, the loss_step function in trainning code seems that it does not use classifier free guidance cause we don't set the textual condition to be ""(empty string) with some probability. So I want to ask, is that because we do not need the classifier free guidance during fine-tune process or we just forget to use it? And what's more, just diving into the
StableDiffusionInpaintPipeline
in huggingface official libdiffusers
I see they using the classifier free guidance during inference. As we can see the code below:Here is the code about textual information used during train. We can see that the text is just the raw text provided from the dataset captions, and isn't set as the empty string with some probability such as 50%. So I guess we don't use classifier-free guidance during fine-tine process.
https://github.com/cloneofsimo/lora/blob/bdd51b04c49fa90a88919a19850ec3b4cf3c5ecd/lora_diffusion/cli_lora_pti.py#L315-L324
I'm also wondering that can we just consider the image condition as the textual condition and use the classifier-free guidance? Cause during
StableDiffusionInpaintPipeline
call function, I don't see the classifier free guidance for image condition. They just use the textual info for classifier free guidance.In the code below, we can see
latent_model_input
is just concated withmask
andmasked_image_latents
. Maybe we can also use classifier-free guidance for image information.https://github.com/cloneofsimo/lora/blob/bdd51b04c49fa90a88919a19850ec3b4cf3c5ecd/lora_diffusion/cli_lora_pti.py#L306-L313
Thank you for reading the above content. Currently, I am also conducting relevant experiments, and I hope you can provide some valuable suggestions!
In your code, I see the define of the parameter,
blur_amount
but I don't see you use it.https://github.com/cloneofsimo/lora/blob/bdd51b04c49fa90a88919a19850ec3b4cf3c5ecd/lora_diffusion/cli_lora_pti.py#L853
I think maybe you forget to use it here:
https://github.com/cloneofsimo/lora/blob/bdd51b04c49fa90a88919a19850ec3b4cf3c5ecd/lora_diffusion/dataset.py#L214-L219
because the official defination of this function is below:
My English writing is not good, I hope you can forgive me
Finally, thank you for watching and looking forward to your reply!