anhnh2002 / XTTSv2-Finetuning-for-New-Languages

60 stars 17 forks source link

DVAE vs non-DVAE finetuning quality output #1

Closed C00reNUT closed 1 month ago

C00reNUT commented 1 month ago

Hello,

thank you for making this public!

May I ask you - based on your experiments how much finetuning/non-finetuning of DVAE has had on the final model performance?

I know it is probably hard to measure this somewhat exactly, but it would be nice to know whether it's 10-20 percent better, or much better, or just very slight almost unseen quality gain...

Thank you!

anhnh2002 commented 1 month ago

Thank you for your interest in our work!

From my experience, there are two main benefits to finetuning the DVAE:

  1. The primary advantage is that it addresses the short text issue. This problem in the model provided by Coqui may be due to the DVAE being trained only on audio samples longer than 3 seconds.

  2. The second benefit is particularly noticeable if your target language has significantly different phonetics compared to the languages supported by Coqui. In such cases, finetuning can improve the results by approximately 20%.

It's worth noting that finetuning the DVAE is not a very time-consuming process. As a reference, it takes about 1 hour to finetune on 100 hours of audio using an A100 40GB GPU.

I hope this information helps give you an idea of the potential improvements. If you find our code helpful, please consider giving it a star~

C00reNUT commented 1 month ago

Thank you for providing such a detailed information, I didn't know that Coqui finetuned/trained DVAE, I just assumed they used original Tortoise one.

anhnh2002 commented 1 month ago

Thank you for providing such a detailed information, I didn't know that Coqui finetuned/trained DVAE, I just assumed they used original Tortoise one.

I guess so, i'm not sure about that.

thivux commented 1 month ago
  1. This problem in the model provided by Coqui may be due to the DVAE being trained only on audio samples longer than 3 seconds.

@nguyenhoanganh2002 are you sure about the training data of DVAE by Coqui being > 3s? i tried doing inference on the original XTTSv2 trained on 16 languages (https://huggingface.co/spaces/coqui/xtts) and it can handle short text nicely.