Hello. I recently trained a tortoise-tts model using 300 hours of Chinese speech data, with an average duration of 13 seconds per sample. However, I encountered a very peculiar issue: the model's mel/text loss decreases normally on the training set, but sharply increases on the validation set. It seems that the model is overfitting.
Here are my parameter settings:
Learning Rate = 0.0001
Mel LR Ratio = 1
Text LR Ratio = 1
Learning Rate Scheme = Cosine Annealing
Learning Rate Restarts = 4
Batch Size = 128
Gradient Accumulation Size = 1
Validation Enabled = True (so I can observe the overfitting phenomenon).
The loss curves I observed are:
When I reduced the learning rate to 5e-5, the problem seemed to be alleviated somewhat, but the trained model still lacks generalization ability on the validation set.
As for the Chinese speech corpus, I used the G2Pw-pinyin module contained in Bert-VITS2 repo (Extra-Fix branch) to convert the Chinese characters into corresponding pinyin phonemes. Some data points look like:
Hello. I recently trained a tortoise-tts model using 300 hours of Chinese speech data, with an average duration of 13 seconds per sample. However, I encountered a very peculiar issue: the model's mel/text loss decreases normally on the training set, but sharply increases on the validation set. It seems that the model is overfitting.
Here are my parameter settings:
The loss curves I observed are:
When I reduced the learning rate to 5e-5, the problem seemed to be alleviated somewhat, but the trained model still lacks generalization ability on the validation set.
As for the Chinese speech corpus, I used the G2Pw-pinyin module contained in Bert-VITS2 repo (Extra-Fix branch) to convert the Chinese characters into corresponding pinyin phonemes. Some data points look like: