IIEleven11 / StyleTTS2FineTune

177 stars 32 forks source link

insights on how to decide params for new dataset #12

Closed anush97 closed 5 months ago

anush97 commented 5 months ago

Hi, this isn't as such an issue but I wanted to know How did you decide on the parameter values for fine-tuning the respective datasets? Could you share any tips for beginners on how to adjust parameters effectively to ensure they work well with the data? Also I saw you are currently working on xtts , what would you say about styletts in comparison?

IIEleven11 commented 5 months ago

Which parameters are you talking about? The silence buffers and insertion? It's really a trial and error thing. It really sucks. And yeah I noticed CoquiAI's default script had a bunch of issues with the hyper parameters. Im attempting to fix them. In comparison with styletts2, theyre close except xttsv2 doesnt require nearly or any money at all to train. Styletts2 cant be trained on consumer grade hardware and its like at least 80 bucks a model. Plus styletts2 isnt easy to train either so likely youll fail and thats money down the hole.

anush97 commented 5 months ago

I was referring to the parameters in the config.yml file. If I want to use StyleTTS to fine-tune a different dataset, how should I decide on the parameter values? Could you share any observations or tips you noticed while verifying if these values worked for your dataset? I am aiming to build a basic synthesis script that demonstrates voice cloning. Any useful tips?

Regarding XTTS, I have worked on and fine-tuned it before. However, I found its detailed architecture challenging to understand. Did you have any success with it? Despite its input duration limitations, I found it relatively easier to fine-tune for multiple languages

IIEleven11 commented 5 months ago

I was referring to the parameters in the config.yml file. If I want to use StyleTTS to fine-tune a different dataset, how should I decide on the parameter values? Could you share any observations or tips you noticed while verifying if these values worked for your dataset? I am aiming to build a basic synthesis script that demonstrates voice cloning. Any useful tips?

Regarding XTTS, I have worked on and fine-tuned it before. However, I found its detailed architecture challenging to understand. Did you have any success with it? Despite its input duration limitations, I found it relatively easier to fine-tune for multiple languages

The best voice ive ever heard is one of my old xttsv2 models. But as of right now this new model will ALWAYS overfit around the 6th epoch. I have been at it for days trying to narrow down the issue. I'm getting a bit frustrated, it appears to be something very inherently wrong with how they left the training script. Yeah I agree though they do write some complex code. It's been quite the process.

Theres a lot of parameters for stts2, your main target though should be to extend max_len as high as you can. This will mean lowering batch_size and therefore a much longer training period. Ideally I think the consensus is around max_len=8 is a solid target. Ive successfully trained on max_len=6 before though. But I had to drop batch size to 2 and training took like a week. I cant really give any specific values though as it will always be different for everyone and every dataset. It's trial and error mostly.