Bai-YT / ConsistencyTTA

ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
MIT License
26 stars 0 forks source link

Question from Consistency Model Beginner #1

Closed Lxr199 closed 1 month ago

Lxr199 commented 2 months ago

Thank you very much for your excellent work!

I am a beginner currently learning about diffusion models, and I would like to ask whether the teacher model used to finally train the consistency model has been distilled from TANGO, which was trained with DDPM schedule, using the DDIM method? If I want to perform a similar consistency distillation, does my teacher need to be one that was trained using the DDIM schedule, or can it be one trained with the DDPM schedule instead?

Bai-YT commented 2 months ago

Hi! Teacher can be trained with DDPM (we used DDPM as in TANGO paper). During distillation, we recommend using the Heun solver to query the teacher. It works better than DDIM in our experiments.

Lxr199 commented 2 months ago

Thank you so much for your kind reply! I have a further question related to the concept and theory:

I am a bit confused about the Heun solver, particularly the Karras part. I am exploring how to customize a model structure to follow the general logic of your code and apply consistency distillation to my task (which is not related to image/audio with UNet, so I must modify the code). In my previous work to generate a pre-trained model, the model was trained and sampled according to Karras, which seems to set "sigma_min" and "sigma_max" instead of using betas as in the default Heun solver. How should I understand this difference? Can I directly switch my previous Karras schedule to the official Heun solver in Hugging Face for training and sampling, by slightly altering some logic related to noise addition and timestep embedding?

And just one last quick question: Is it doable to distill the consistency model by directly predicting the denoised sample instead of predicting the added noise? (My pretrained model performs well by directly predicting denoised samples).

I apologize for the lengthy questions. Thank you again for your wonderful work! It really inspires me a lot!

Bai-YT commented 1 month ago

Yes the Heun solver is fully compatible with the Karras schedule. We used Huggingface’s diffuser package for the schedulers. Their API allows for selecting uniform or Karras schedule. The API also allows for switching between $\epsilon$ or $x_0$ prediction. Their documentations should have a detailed discussion on this.

Lxr199 commented 1 month ago

Thank you so much! I have successfully integrated my pretrained model into your code thanks to your work.

However, when I proceed to the training process, I notice that the loss does not decrease at all. I have been trying to identify the issue for several days but have made no progress. My pretrained model strictly follows the Karras EDM schedule and achieves very good generation results with 20+ steps sampling process using the Karras schedule. I understand that this issue could arise from many different factors, but I would like to ask if you could share some of your experiences during the training process? Lastly, thank you again for the very clear code structure and your previous patience. Excellent work!

Bai-YT commented 1 month ago

My observation has been that the consistency loss decreases during the first epoch of AudioCaps. After that, the loss becomes relatively stagnant and does not go to zero. Despite this, the generation quality of the consistency student model continues to improve.

Lxr199 commented 1 month ago

Good news to tell my model also works pretty good. Thank you so much for your exceptional support!

Bai-YT commented 1 month ago

No problem. Glad to hear that!