Text-to-Audio / Make-An-Audio

PyTorch Implementation of Make-An-Audio (ICML'23) with a Text-to-Audio Generative Model
MIT License
734 stars 107 forks source link

Make-An-Audio 2 1D VAE #5

Open MoayedHajiAli opened 6 months ago

MoayedHajiAli commented 6 months ago

Hello,

I noticed that in the Make-an-aduio 2 paper, you have not reported the reconstruction loss performance of your trained 1D VAE in comparison with the 2D one. I am wondering if you could get a similar reconstruction performance with the 1D VAE, or the reconstruction performance was inferior yet the generation quality was better overall.

Thank you for your help.

Darius-H commented 5 months ago

The reconstruction performance is a little worse than 2D VAE but it works better when using with diffusion. Training 1D VAE with 2D discriminators is prone to instability problems, leading to overly smooth results. Because 2D patch GAN is very strong. So we use R1 regularization and set r1_reg_weight=3, disc_factor=2 to stabilize the training https://github.com/Text-to-Audio/Make-An-Audio/blob/8d4f84e6db5cb383673de3d63510410bc7deb037/ldm/modules/losses_audio/contperceptual.py#L54C1-L57C43

MoayedHajiAli commented 5 months ago

Thank you very much @Darius-H for your help. I would appreciate it if you can share the reconstruction loss for the 1d VAE if it is available to you. Looking forward for the full training code and training configuration.

Darius-H commented 3 months ago

Thank you very much @Darius-H for your help. I would appreciate it if you can share the reconstruction loss for the 1d VAE if it is available to you. Looking forward for the full training code and training configuration.

Make-An-Audio 2 is released in https://github.com/bytedance/Make-An-Audio-2. The loss figure of 1D VAE:

image