VITS model gives bad results (training an italian tts model)

iDavide commented 1 month ago

Describe the bug

Hi everyone. I'm new to the world of ML, so I'm not used to training AI models... I really want to create my own TTS model using coqui's VITS trainer, so I've done a lot of research about it. I configured some dataset parameters and configuration functions and then started training. For the training I used almost 10 hours of audio spoken in Italian. After training I tried the model but the result is not bad, it's FAIRLY bad... The model doesn't even "speak" a language. Here is an example of the sentence: "input_text": ""input_text": "Oh, finalmente sei arrivato fin qui. Non è affatto comune che un semplice essere umano riesca a penetrare così profondamente nella mia dimora. Scarlet Devil Mansion non è un posto per i deboli di cuore, lo sapevi?""

(I do not recommend to listen to the audio at full volume.)

https://github.com/user-attachments/assets/b4039119-2666-455f-8ed7-6a0b05179f8f

The voice of the audio is actually from a RVC model. I imported the model into a program that makes TTS first and then uses the weights of a RVC model to the generated audio. It's not a RVC problem because I used this program with the same RVC and other TTS models (mostly in english and one in italian) and they work well, especially the english ones.

To Reproduce

Here's my configuration:

Dataset config:

output_path = "/content/gdrive/MyDrive/tts" dataset_config` = BaseDatasetConfig( formatter="ljspeech", meta_file_train="test.txt", path=os.path.join(output_path, "Dataset/"), language="it" )

Dataset format:

wav_file|text|text

imalavoglia_00_verga_f000053|Milano, diciannove gennaio mille ottocento ottantuno.|Milano, diciannove gennaio mille ottocento ottantuno.

Audio:

audio_config = VitsAudioConfig( sample_rate=22050, win_length=1024, hop_length=256, num_mels=80, mel_fmin=0, mel_fmax=None )

Characters:

character_config = CharactersConfig( characters_class= "TTS.tts.models.vits.VitsCharacters", characters= "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890àèìòùÀÈÌÒÙáéíóúÁÉÍÓÚî", punctuations=" !,.?-'", pad= "", eos= "", bos= "", blank= "", )

General config:

config = VitsConfig( audio=audio_config, characters=character_config, run_name="vits_vctk", batch_size=16, eval_batch_size=4, num_loader_workers=4, num_eval_loader_workers=4, run_eval=True, test_delay_epochs=0, epochs=10, text_cleaner="multilingual_cleaners", use_phonemes=False, phoneme_language="it", phoneme_cache_path=os.path.join(output_path, "phoneme_cache"), compute_input_seq_cache=True, print_step=25, print_eval=False, save_best_after=1000, save_checkpoints=True, save_all_best=True, mixed_precision=True, max_text_len=250, output_path=output_path, datasets=[dataset_config], cudnn_benchmark=False, test_sentences=[ "Qualcosa non va? Mi dispiace, hai voglia di parlarne a riguardo?", "Il mio nome è Remilia Scarlet. come posso aiutarti oggi?", ] )`

Expected behavior

No response

Logs

No response

Environment

- TTS version: 0.22.0
- Python version: 3.10.9
- OS: Windows
- CUDA version: 11.8
- GPU: GTX 1650 with 4GB of VRAM
All the libraries were installed via pip command

Additional context

Additionally, After few days I tried to use espeak phonemes but the trainer.fit() function stucks at the beginning with this output:

EPOCH: 0/10 --> /content/gdrive/MyDrive/tts/vits_vctk-October-09-2024_08+23PM-0000000

DataLoader initialization | > Tokenizer: | > add_blank: True | > use_eos_bos: False | > use_phonemes: True | > phonemizer: | > phoneme language: it | > phoneme backend: espeak | > Number of instances : 5798 /usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py:557: UserWarning: This DataLoader will create 4 worker processes in total. Our suggested max number of worker in current system is 2, which is smaller than what this DataLoader is going to create. Please be aware that excessive worker creation might get DataLoader running slow or even freeze, lower the worker number to avoid potential slowness/freeze if necessary. warnings.warn(_create_warning_msg(

TRAINING (2024-10-09 20:23:45) > Preprocessing samples > Max text length: 167 > Min text length: 12 > Avg text length: 82.22473266643671

> Max audio length: 183618.0

> Min audio length: 24483.0

> Avg audio length: 82634.87443946188

> Num. instances discarded samples: 0

> Batch group size: 0.

/usr/local/lib/python3.10/dist-packages/torch/functional.py:666: UserWarning: stft with return_complex=False is deprecated. In a future pytorch release, stft will return complex tensors for all inputs, and return_complex=False will raise an error. Note: you can still call torch.view_as_real on the complex output to recover the old return format. (Triggered internally at ../aten/src/ATen/native/SpectralOps.cpp:873.) return _VF.stft(input, n_fft, hop_length, win_length, window, # type: ignore[attr-defined]

TRAINING (2024-10-09 20:23:45)	> Preprocessing samples	> Max text length: 167	> Min text length: 12	> Avg text length: 82.22473266643671
> Max audio length: 183618.0
> Min audio length: 24483.0
> Avg audio length: 82634.87443946188
> Num. instances discarded samples: 0
> Batch group size: 0.

iDavide commented 1 month ago

UPDATE: that wasn't a phonemes problem. The training wasn't loading because i didn't use the colab gpu. So the problem it's all about the configuration i guess? Now I set epochs=2000 and I keep you update if I come back with some results.

iDavide commented 1 month ago

i keep training the model and now it has done 33k steps. the function of avg_loss_1 on the tensorboard is coverging to 25 but im noticing it's improving very slowly... is there a method to use efficiently colab's gpu? am i doing the training right based on the configuration? how many steps should i do in order to get a good vits model?

iDavide commented 3 weeks ago

i stopped training the model because i thought the dataset might not be sufficient (~8 hours of speech). i'll try with bigger dataset such as the MLS one... additionally, i came to a conclusion: since the tts model generates mel spectrograms (not sure about that), i need to train an italian vocoder model, but i dont really know how to do that with a specific language. any comment is appreciated

eginhard commented 3 weeks ago

Vits directly outputs audio because it trains its own vocoder internally, you don't need to train a separate one.

iDavide commented 2 weeks ago

Vits directly outputs audio because it trains its own vocoder internally, you don't need to train a separate one.

you saved me a lot of time, thank you. by the way, now im building the transcripts for MLS dataset deleting all the files with a male voice (i actually want to train a woman voice model) but the thing is: when i train using Colab, the first epoch goes slowly while the next ones go really fast... if im using a dataset with a bunch of audio files,the epochs require more steps to do. said so, should i train using my gpu (gtx 1650) or colab gpu (free plan)?

coqui-ai / TTS