coqui-ai / TTS

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production
http://coqui.ai
Mozilla Public License 2.0
34.98k stars 4.26k forks source link

VITS model gives bad results (training an italian tts model) #4017

Open iDavide opened 2 weeks ago

iDavide commented 2 weeks ago

Describe the bug

Hi everyone. I'm new to the world of ML, so I'm not used to training AI models... I really want to create my own TTS model using coqui's VITS trainer, so I've done a lot of research about it. I configured some dataset parameters and configuration functions and then started training. For the training I used almost 10 hours of audio spoken in Italian. After training I tried the model but the result is not bad, it's FAIRLY bad... The model doesn't even "speak" a language. Here is an example of the sentence: "input_text": ""input_text": "Oh, finalmente sei arrivato fin qui. Non è affatto comune che un semplice essere umano riesca a penetrare così profondamente nella mia dimora. Scarlet Devil Mansion non è un posto per i deboli di cuore, lo sapevi?""

(I do not recommend to listen to the audio at full volume.)

https://github.com/user-attachments/assets/b4039119-2666-455f-8ed7-6a0b05179f8f

The voice of the audio is actually from a RVC model. I imported the model into a program that makes TTS first and then uses the weights of a RVC model to the generated audio. It's not a RVC problem because I used this program with the same RVC and other TTS models (mostly in english and one in italian) and they work well, especially the english ones.

To Reproduce

Here's my configuration:

Dataset config:

output_path = "/content/gdrive/MyDrive/tts" dataset_config` = BaseDatasetConfig( formatter="ljspeech", meta_file_train="test.txt", path=os.path.join(output_path, "Dataset/"), language="it" )

Dataset format:

wav_file|text|text

imalavoglia_00_verga_f000053|Milano, diciannove gennaio mille ottocento ottantuno.|Milano, diciannove gennaio mille ottocento ottantuno.

Audio:

audio_config = VitsAudioConfig( sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None )

Characters:

character_config = CharactersConfig( characters_class= "TTS.tts.models.vits.VitsCharacters", characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890àèìòùÀÈÌÒÙáéíóúÁÉÍÓÚî", punctuations=" !,.?-'", pad= "", eos= "", bos= "", blank= "", )

General config:

config = VitsConfig( audio=audio_config, characters=character_config, run_name="vits_vctk", batch_size=16, eval_batch_size=4, num_loader_workers=4, num_eval_loader_workers=4, run_eval=True, test_delay_epochs=0, epochs=10, text_cleaner="multilingual_cleaners", use_phonemes=False, phoneme_language="it", phoneme_cache_path=os.path.join(output_path, "phoneme_cache"), compute_input_seq_cache=True, print_step=25, print_eval=False, save_best_after=1000, save_checkpoints=True, save_all_best=True, mixed_precision=True, max_text_len=250, output_path=output_path, datasets=[dataset_config], cudnn_benchmark=False, test_sentences=[ "Qualcosa non va? Mi dispiace, hai voglia di parlarne a riguardo?", "Il mio nome è Remilia Scarlet. come posso aiutarti oggi?", ] )`

Expected behavior

No response

Logs

No response

Environment

- TTS version: 0.22.0
- Python version: 3.10.9
- OS: Windows
- CUDA version: 11.8
- GPU: GTX 1650 with 4GB of VRAM
All the libraries were installed via pip command

Additional context

Additionally, After few days I tried to use espeak phonemes but the trainer.fit() function stucks at the beginning with this output:

EPOCH: 0/10 --> /content/gdrive/MyDrive/tts/vits_vctk-October-09-2024_08+23PM-0000000

DataLoader initialization | > Tokenizer: | > add_blank: True | > use_eos_bos: False | > use_phonemes: True | > phonemizer: | > phoneme language: it | > phoneme backend: espeak | > Number of instances : 5798 /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg(

TRAINING (2024-10-09 20:23:45) > Preprocessing samples > Max text length: 167 > Min text length: 12 > Avg text length: 82.22473266643671
> Max audio length: 183618.0
> Min audio length: 24483.0
> Avg audio length: 82634.87443946188
> Num. instances discarded samples: 0
> Batch group size: 0.

/usr/local/lib/python3.10/dist-packages/torch/functional.py:666: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:873.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]

iDavide commented 2 weeks ago

UPDATE: that wasn't a phonemes problem. The training wasn't loading because i didn't use the colab gpu. So the problem it's all about the configuration i guess? Now I set epochs=2000 and I keep you update if I come back with some results.

iDavide commented 4 days ago

i keep training the model and now it has done 33k steps. the function of avg_loss_1 on the tensorboard is coverging to 25 but im noticing it's improving very slowly... is there a method to use efficiently colab's gpu? am i doing the training right based on the configuration? how many steps should i do in order to get a good vits model?