Quirky Issues Training a Chinese Tortoise-TTS Model

Hello. I recently trained a tortoise-tts model using 300 hours of Chinese speech data, with an average duration of 13 seconds per sample. However, I encountered a very peculiar issue: the model's mel/text loss decreases normally on the training set, but sharply increases on the validation set. It seems that the model is overfitting.

Here are my parameter settings:

Learning Rate = 0.0001
Mel LR Ratio = 1
Text LR Ratio = 1
Learning Rate Scheme = Cosine Annealing
Learning Rate Restarts = 4
Batch Size = 128
Gradient Accumulation Size = 1
Validation Enabled = True (so I can observe the overfitting phenomenon).

The loss curves I observed are:

When I reduced the learning rate to 5e-5, the problem seemed to be alleviated somewhat, but the trained model still lacks generalization ability on the validation set.

As for the Chinese speech corpus, I used the G2Pw-pinyin module contained in Bert-VITS2 repo (Extra-Fix branch) to convert the Chinese characters into corresponding pinyin phonemes. Some data points look like:

JarodMica / ai-voice-cloning

Quirky Issues Training a Chinese Tortoise-TTS Model #90