Question about FastDiff-TTS

LEECHOONGHO commented 2 years ago

Hello, Thank you for sharing your code with community. I'm trying to implement FastDiff-TTS model with my dataset.

My model pronounce well after 120k learning, but the sound quality is not good yet. So, I have some question for FastDiff-TTS's tendency.

I used pre-derived noise schedule for noise scheduling. If I use the pre-derived scheduler, do FastDiff's sound quality have limitation?
How much training steps required for good sound quality or convergence.
I can hear a minute noise in your demo audio. Is there any way to remove that noise?
Did you have tried multi-speaker-TTS for FastDiff-TTS?

The Audio Sample of my model is at the url below. https://lime-honeycrisp-5e3.notion.site/Multi-speaker-FastDiff-TTS-5bae38d4562144059bf84651f603ff28

Thank you.

Rongjiehuang commented 2 years ago

Hi, thanks for your interest.

Yes, if using the pre-derived schedule, a quality gap could be witnessed as the generated samples become different from the one used in the noise scheduling process.
500k training steps is required.
Noisy sound could be alleviated by using a converged model, or with more denoising steps.
I have not tried this multi-speaker version, but I think you could implement it by adding the speaker embedding.

To address the issue in this demo, you could try more training steps with more GPUs (the End-to-End TTS model typically require a large batch size for better convergence). Besides, for better quality, sampling with more denoising steps is recommended.

LEECHOONGHO commented 2 years ago

Thank you for your reply.

I'm sharing my theta loss for reference.

LEECHOONGHO commented 2 years ago

Rongjiehuang / FastDiff

Question about FastDiff-TTS #3