Is it possible to train TTS for a new language?

AigizK commented 18 hours ago

Thank you for your work. I would like to inquire about the possibility of training for a new language. If this is feasible, could you please provide more details on the following:

How much data is required?
In what format should the data be?
What resources are approximately needed to achieve results comparable to those for the English language?

Your insights on this matter would be greatly appreciated. Thank you in advance for your assistance.

ScottishFold007 commented 17 hours ago

1. Required Data Volume

Data Volume: Approximately 95K hours of English and Chinese data were used to train the base model. To achieve a high-quality TTS system, a large amount of data is required. For stable performance, at least 10,000 hours of voice data for a certain language is needed.

2. Data Format

Audio Format: The audio files should be in mono WAV format, with a sampling rate of 24kHz, using 100-dimensional log Mel spectrogram features, and a frame hop length of 256.
Text Processing: English uses letters and symbols directly; Chinese characters are processed into complete pinyin through Jieba segmentation and pypinyin.

3. Required Resources

Computational Resources: Training of the base model was conducted on 8 NVIDIA A100 80G GPUs, lasting more than a week. This demonstrates the significant computational resources required to train a high-quality TTS model.
Model Configuration: The base model includes 22 layers, 16 attention heads, and embedding/feed-forward network (FFN) dimensions of 1024/2048. The configuration of the small model is detailed in Appendix B.1.

4. Results Comparable to English

Model Performance: To ensure high-quality TTS comparable to English, a large amount of training data and advanced model architecture were used. In addition, various test sets were used for evaluation, including LibriSpeech-PC test-clean, Seed-TTS test-en, and Seed-TTS test-zh, to ensure the model's wide applicability and fairness of comparison.

5. Training Details

Optimizer: The AdamW optimizer was used, with a peak learning rate of 7.5e-5, linear warm-up for the first 20K updates, followed by linear decay.
Regularization: Attention and FFN use a dropout rate of 0.1 to prevent overfitting.
Data Augmentation: Random masking of 70% to 100% of Mel spectrogram frames for fill-in-the-blank training tasks.

SWivid commented 17 hours ago

All training details is mentioned in our paper.

And you could simply train your own model for a new language:

Leverage Emilia Dataset (DE EN FR JA KO ZH), as we have include script for it (NOTE. download the mentioned version of Emilia in script, cuz it's currently updated to a WebDataset ver.)
or prepare your own data pairs if not covered, just tailor a Dataset Class in model/dataset.py to your need

For Base model (multilingual, ~300M), we use <50K hours for each language (EN ZH) For Small model (e.g. Chinese-only, ~150M), we have made it work with just 1K hours data, config. mentioned in our paper also

Just one thing, the training would take a long time, especially for E2 TTS (if you choose) And be patient, 8 x RTX3090 small model for one week (200~400K updates to hear something reasonable) 8 x A100 for base model similarly.

SWivid / F5-TTS