Closed howitry closed 6 months ago
- The speed is a bit slow; you can check if it's an I/O bottleneck.
- Generally speaking, a GPT loss below 2 can generate meaningful results.
- This is normal; after fine-tuning GPT, you also need to fine-tune the diffusion model.
Thank you for your reply. Have you ever trained hifigan directly from latent to wav? How much impact does removing the diffusion model have on the quality of the generated audio?
I didn't train HifiGAN from latent to wav because I believe diffusion is more controllable for some tasks. Xtts and GPT-SOVITS have trained a HifiGAN from latent to wav, so you might want to take a look at their work. In most cases, diffusion and HifiGAN can be used interchangeably.
I didn't train HifiGAN from latent to wav because I believe diffusion is more controllable for some tasks. Xtts and GPT-SOVITS have trained a HifiGAN from latent to wav, so you might want to take a look at their work. In most cases, diffusion and HifiGAN can be used interchangeably.
I finetune the vqvae on 3200+ hours of data. When the training reaches 6W steps, the loss suddenly rises sharply, the commit loss and gradient suddenly drop to close to 0, and the output mel spectrum is abnormal. But the checkpoint before step 6W is normal. Have you ever encountered this problem?
Yes, I've encountered this issue as well. The context-aware VQVAE in the v2 branch doesn't have this problem. You can try training on the v2 version of VQVAE.
Yes, I've encountered this issue as well. The context-aware VQVAE in the v2 branch doesn't have this problem. You can try training on the v2 version of VQVAE.
Nice! I found that the model structure of diffusion is different from that in Tortoise. Why did you choose the diffusion structure in aa_model.py? Is it faster or better than the diffusion model used by Tortoise?
The diffusion in Tortoise cannot achieve zero-shot, and I have added modules such as mrte to enhance its zero-shot capabilities. Although I feel that some may not have been effective.
The diffusion in Tortoise cannot achieve zero-shot, and I have added modules such as mrte to enhance its zero-shot capabilities. Although I feel that some may not have been effective.
I use this model for prosody trans。for example,A-spk's audio a.wav,B-spk's audio b.wav。I use a.wav as cond use gpt model and gen the latent,than I use this latent and b.wav as cond for diffusion model。The gen wav‘s timbre is A-spk,so in the diffusion model, the cond is not effective。 the gpt latent have more spk information?
The diffusion in Tortoise cannot achieve zero-shot, and I have added modules such as mrte to enhance its zero-shot capabilities. Although I feel that some may not have been effective.
When I was training the same diffusion as the pre-trained model you currently provide, I found that the model loss converged very quickly. When the loss converges, the quality of the generated audio is not very good. Although the pronunciation of the words is correct, it is mixed with noise. The losses of the training set and validation set are as follows:
I want to know: (1) The provided pre-trained model diffusion-855.pt was trained for 85.5W steps. I want to know whether it also converges quickly? After convergence, will the audio quality continue to improve as the number of training steps increases? (2) Does diffusion training also require a lot of data? I used 3200 hours of data, and it still converged quickly. (3) Whether normalize_tacotron_mel() needs to be adjusted according to the training set.
Hello, thanks for your code contribution. I try to train the gpt on my datasets, I have some questions: