Edresson / YourTTS

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
Other
877 stars 77 forks source link

train.py #8

Closed stalevna closed 1 year ago

stalevna commented 2 years ago

Hello! I was wondering If you could kindly share your train.py?

Edresson commented 2 years ago

Hi,

The article was made using my Coqui TTS fork on the branch multilingual-torchaudio-SE.

To replicate the training you can use this branch and with the config.json available with each checkpoint use: python3 TTS/bin/train_tts.py --config_path config.json

If you want to use the latest version of the Coqui TTS you can get the config.json from the Coqui released model.

With config.json in hand, you first need to adjust some config.json paths. For example, "datasets", "output_path" and "d_vector_file".

In "d_vector_file" you need to pass the speaker embeddings of the speakers. To extract the speaker's embeddings use the following command: python3 TTS/bin/compute_embeddings.py model_se.pth.tar config_se.json config.json d_vector_file.json

"model_se.pth.tar" and "config_se.json" can be found in Coqui released model while config.json is the config you set the paths for.

In Coqui TTS we provide a lot of "recipes" that can be easily used by beginners. But, I normally don't use "recipes" I prefer to train the models using the config.json. However, in the near future, we intend to make training instructions available as "recipes".

stalevna commented 2 years ago

Thank you so much! You are simply the best!

Ca-ressemble-a-du-fake commented 2 years ago

This is a great answer, thank you! I could do transfer learning with the provided model and quickly have a French tts. Cool!

annaklyueva commented 2 years ago

@Edresson Thank you for the answer!!!

I wanted to fine-tune the existing YourTTS model, do I understand correctly that procedure is the same as in training?

Ca-ressemble-a-du-fake commented 2 years ago

@annaklyueva this is what I did (using the restore flag with the model_file provided in the released model) and it worked.

annaklyueva commented 2 years ago

Thank you! @Ca-ressemble-a-du-fake

annaklyueva commented 2 years ago

Hello! @Ca-ressemble-a-du-fake I have some problems with the dataset for fine-tuning. AssertionError: [!] You do not have enough samples to train. You need at least 100 samples.

However, it is much more that 100 samples. That is how I filled my config file (the part with the datasets):

"datasets": [ { "name": "vctk", "path": "datasets/first_voice/wavs/", "meta_file_train": "datasets/first_voice/metadata.txt", "ununsed_speakers": [ "first" ], "language": "en", "meta_file_val": null, "meta_file_attn_mask": "" } ]

Maybe something is wrong with it? Could you please help?

Edresson commented 2 years ago

@annaklyueva You can work around it setting the "meta_file_val" to "datasets/first_voice/metadata.txt". In this way, you will have the same data in training and validation it is not recommended in most cases, but in case you do have not so many samples it can be the solution. It is weird that you have more than 100 samples and receive this error. Please apply this solution and check your training logs (especially the part that has the dataset infos).

annaklyueva commented 2 years ago

Good day!

I have several questions, could you please help?

  1. Do I understand correctly that if I want to train the model on another language it is better to fine tune this model (YourTTS-EN(VCTK+LibriTTS)-PT-FR SCL): https://drive.google.com/drive/folders/15G-QS5tYQPkqiXfAdialJjmuqZV0azQV

Or it is better to use other checkpoints.

  1. How many hours of audio is needed to have appropriate quality?

  2. I planned to use Common Voice Corpus to fine-tune the model on a new language, however, the audio format is mp3 not wav. Do I need to convert all the audio files or I can use mp3 format. If yes, how?

Thank you for your time in advance!

annaklyueva commented 2 years ago

Good day, @Edresson !

Do I understand correctly that I don't need to change config_se.json file? Or I need to put "output_path" and "audio" parameters?

I ask about "audio" parameters because they differ from the one in config.json file. And could you advise if I can change some audio parameters to resample the audio files, do_sound_norm and so on?

stalevna commented 2 years ago

Hi! Could you kindly what are the appropriate loss_gens when training the model for experiments 1-4?

VoxFurem commented 2 years ago

Hi, I try to finetune and to restart the training from zeo but in both case I am not able to replace your files on the YourTTS-zeroshot-VC-demo example by my own. I have the folowing error at the last cell : RuntimeError: Expected tensor for argument #1 'indices' to have one of the following scalar types: Long, Int; but got torch.cuda.FloatTensor instead (while checking arguments for embedding) Do you have any idea to help me please ?

davidmartinrius commented 2 years ago

4. I planned to use Common Voice Corpus to fine-tune the model on a new language, however, the audio format is mp3 not wav. Do I need to convert all the audio files or I can use mp3 format. If yes, how?

Hello!

you have two options.

1- you can use TTS/bin/resample.py to convert mp3 to wav. It works with multiple threads (as many as your cpu has or less)

2- I share a simple code to do it. It is not the cleanest code but it works. Also it converts the wav to one channel (mono). You need to set the path to tsv file and set the path of the clips folder. It works in one thread, but if you wanted you can adapt it to work it with a multithread (see it in resample.py).

On the other hand, you could save items list to a file and create your own method in TTS/tts/datasets/formatters.py to get that file with the json structure already proceesed. In this way, when training a model there is no need to process the csv/tsv again.

import os
import csv
from os.path import exists
from os import path
from pydub import AudioSegment

lang = "en"
root_path = f"/language_path/{lang}/"
sample_rate = 22050
clips_path = root_path + "clips"

txt_file = root_path + "validated.tsv"
items = []
i = 0
test = False
with open(txt_file, "r", encoding="utf-8") as ttf:
    for line in ttf:
        if i == 0:
            i = i + 1
            continue
        cols = line.split("\t")
        speaker_name = cols[0]
        mp3_file = os.path.join(clips_path, cols[1])
        text = cols[2]

        file_exists = exists(mp3_file)

        if file_exists:
            wav_file = os.path.join(clips_path, cols[1].replace(".mp3", ".wav"))
            wav_file_exists = exists(wav_file)

            if not wav_file_exists:
                print("generating wav file " + wav_file)
                sound = AudioSegment.from_mp3(mp3_file)
                sound = sound.set_frame_rate(sample_rate)
                sound = sound.set_channels(1)
                sound.export(wav_file, format="wav")

            items.append({"text":text, "audio_file":wav_file, "speaker_name":speaker_name})

print()
print("TOTAL ITEMS", len(items))
print()
TejaswiniiB commented 2 years ago

Good day, @Edresson !

Do I understand correctly that I don't need to change config_se.json file? Or I need to put "output_path" and "audio" parameters?

I ask about "audio" parameters because they differ from the one in config.json file. And could you advise if I can change some audio parameters to resample the audio files, do_sound_norm and so on?

Hi @annaklyueva , What is the answer to your question? Did you figure out? What all do we need to change in config_se.json ?

chigkim commented 2 years ago

I would love to see documentation/colab on training/fine tuning for tts and voice conversion!

davidmartinrius commented 2 years ago

I would love to see documentation/colab on training/fine tuning for tts and voice conversion!

Everything is already documented in the main project, Coqui TTS.

So you just need to go there and 🔥🔊

You have multiple ways to train your model and you can train your model with Glow TTS, VITS, etc

https://tts.readthedocs.io/en/latest/training_a_model.html#

chigkim commented 2 years ago

Thanks, but the instruction on Coqui TTS is for TTS, not voice conversion though, right?

If so, is there any colab/documentation on finetuning voice conversion?

Edresson commented 1 year ago

Thanks, but the instruction on Coqui TTS is for TTS, not voice conversion though, right? If so, is there any colab/documentation on finetuning voice conversion?

Voice conversion and TTS instructions are the same. YourTTS is a TTS model that is able to do voice conversion.

chigkim commented 1 year ago

Ah, ok. Am I understanding correctly that I need to train a TTS model for target speaker first, then do the conversion using that model? What about driving speaker? Do I need to train a model for it? Or, do I just need to provide wave file not model?

WeberJulian commented 1 year ago

It's a zero-shot voice cloning model, so you don't need to train on either target or driving speakers (but that helps).