adelacvg / ttts

Train the next generation of TTS systems.
Mozilla Public License 2.0
160 stars 17 forks source link

some questions about training gpt #12

Closed howitry closed 6 months ago

howitry commented 8 months ago

Hello, thanks for your code contribution. I try to train the gpt on my datasets, I have some questions:

  1. I use the same batch size as the configuration file on 8 A100. It takes about 3 hours to train 2000 steps. Is this normal?
  2. Based on your experience, how low should the loss value be?
  3. I only changed the input phoneme sequence, vqvae is consistent with the pre-trained model you provided, and I can get the correct audio using vq decoder. When I use latent as input for pre-trained diffusion and vocos, I cannot get the correct audio. Is this normal?
adelacvg commented 8 months ago
  1. The speed is a bit slow; you can check if it's an I/O bottleneck.
  2. Generally speaking, a GPT loss below 2 can generate meaningful results.
  3. This is normal; after fine-tuning GPT, you also need to fine-tune the diffusion model.
howitry commented 8 months ago
  1. The speed is a bit slow; you can check if it's an I/O bottleneck.
  2. Generally speaking, a GPT loss below 2 can generate meaningful results.
  3. This is normal; after fine-tuning GPT, you also need to fine-tune the diffusion model.

Thank you for your reply. Have you ever trained hifigan directly from latent to wav? How much impact does removing the diffusion model have on the quality of the generated audio?

adelacvg commented 8 months ago

I didn't train HifiGAN from latent to wav because I believe diffusion is more controllable for some tasks. Xtts and GPT-SOVITS have trained a HifiGAN from latent to wav, so you might want to take a look at their work. In most cases, diffusion and HifiGAN can be used interchangeably.

howitry commented 8 months ago

I didn't train HifiGAN from latent to wav because I believe diffusion is more controllable for some tasks. Xtts and GPT-SOVITS have trained a HifiGAN from latent to wav, so you might want to take a look at their work. In most cases, diffusion and HifiGAN can be used interchangeably.

I finetune the vqvae on 3200+ hours of data. When the training reaches 6W steps, the loss suddenly rises sharply, the commit loss and gradient suddenly drop to close to 0, and the output mel spectrum is abnormal. But the checkpoint before step 6W is normal. Have you ever encountered this problem?

adelacvg commented 8 months ago

Yes, I've encountered this issue as well. The context-aware VQVAE in the v2 branch doesn't have this problem. You can try training on the v2 version of VQVAE.

howitry commented 8 months ago

Yes, I've encountered this issue as well. The context-aware VQVAE in the v2 branch doesn't have this problem. You can try training on the v2 version of VQVAE.

Nice! I found that the model structure of diffusion is different from that in Tortoise. Why did you choose the diffusion structure in aa_model.py? Is it faster or better than the diffusion model used by Tortoise?

adelacvg commented 8 months ago

The diffusion in Tortoise cannot achieve zero-shot, and I have added modules such as mrte to enhance its zero-shot capabilities. Although I feel that some may not have been effective.

zshy1205 commented 8 months ago

The diffusion in Tortoise cannot achieve zero-shot, and I have added modules such as mrte to enhance its zero-shot capabilities. Although I feel that some may not have been effective.

I use this model for prosody trans。for example,A-spk's audio a.wav,B-spk's audio b.wav。I use a.wav as cond use gpt model and gen the latent,than I use this latent and b.wav as cond for diffusion model。The gen wav‘s timbre is A-spk,so in the diffusion model, the cond is not effective。 the gpt latent have more spk information?

howitry commented 7 months ago

The diffusion in Tortoise cannot achieve zero-shot, and I have added modules such as mrte to enhance its zero-shot capabilities. Although I feel that some may not have been effective.

When I was training the same diffusion as the pre-trained model you currently provide, I found that the model loss converged very quickly. When the loss converges, the quality of the generated audio is not very good. Although the pronunciation of the words is correct, it is mixed with noise. The losses of the training set and validation set are as follows: 微信图片_20240314230840

微信图片_20240314230851 I want to know: (1) The provided pre-trained model diffusion-855.pt was trained for 85.5W steps. I want to know whether it also converges quickly? After convergence, will the audio quality continue to improve as the number of training steps increases? (2) Does diffusion training also require a lot of data? I used 3200 hours of data, and it still converged quickly. (3) Whether normalize_tacotron_mel() needs to be adjusted according to the training set.