adelacvg / ttts

Train the next generation of TTS systems.
Mozilla Public License 2.0
159 stars 17 forks source link

performance #19

Open yiwei0730 opened 1 month ago

yiwei0730 commented 1 month ago

I want to ask a few questions

  1. What data does your latest Chinese, English, Japanese and Korean demo model use and how long is the data as a training set?
  2. The demo audio file seems to have some slight background noise. Can I reuse your ckpt to continue training to achieve better intensity?
  3. I would like to ask about the zero-shot effect of this model and whether it is suitable for finetune with little data.
adelacvg commented 1 month ago
  1. I use the open-source Genshin Impact dataset, which conveniently includes data in four languages.
  2. Noise might be difficult to entirely avoid, but I believe continuing training is feasible.
  3. I think the zero-shot capability is quite good in terms of timbre, but due to the limited training data, the prosody similarity is still not ideal. Hence, fine-tuning with a small amount of data might not yield good results.
yiwei0730 commented 1 month ago

The third question is to ask For Finetune, I used a little bit of data within 2 minutes to adapt. Can the similarity and naturalness be achieved just for the person? I haved done this in NS2, but the similarity is soso, and the natural still have some noise.