RVC-Boss / GPT-SoVITS

1 min voice data can also be used to train a good TTS model! (few shot voice cloning)
MIT License
33.62k stars 3.85k forks source link

Japanese fine-tuning #241

Open Kamikadashi opened 8 months ago

Kamikadashi commented 8 months ago

From what I understand, the model currently requires fine-tuning on at least 2-3 hours of speech data to produce convincing results in Japanese. Is this correct? Additionally, is it necessary to fine-tune only the SoVITS model, or does the GPT model require it as well?

RVC-Boss commented 8 months ago

Tune both sovits and gpt is better (more similarity). But what epoch number is the best depends on experience. You can test saved weights of each epoch to choose the best one when inference.

RVC-Boss commented 8 months ago

2-3 hours are sure enough.

Kamikadashi commented 8 months ago

Thanks for the answer, I'll try to experiment. What does changing 文本模块学习率权重 achieve?

As I understand, less data is currently required to achieve comparable quality with Chinese. Will this improve for Japanese in the future? Is there an ETA available?

RVC-Boss commented 8 months ago

文本模块学习率权重: During the fine-tuning stage, reduce the text comprehension module learning rate to prevent overfitting from causing anomalous articulation.

You can have a try 10 min Japanese fine tuning using default epoch setting.