CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.53k stars 8.78k forks source link

How to train new model from Mozilla Common Voice? #677

Closed PaYo90 closed 3 years ago

PaYo90 commented 3 years ago

How to train new model from Mozilla Common Voice?

I wish to train new models from Mozilla Commmon Voice. I chosed polish. I already changed .tsv files to match LibriSpeach style "name_of_file.mp3 DESCRIPTION" (i dont know can sometell me if the big/small letters are important?). But this is for i dont know synthesizer? Vocoder? I don't even know what is this... and why do i need three of them??? How can i train three of them, if mozilla gives me only one datasets... what should i do, can someone answer me, how should i train this?

ghost commented 3 years ago

We do not support Mozilla Common Voice directly. Here is my advice.

  1. Process the dataset to look like LibriTTS: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-666099538 Then you can use our preprocessing scripts.
  2. Don't forget to edit symbols.py to include all the letters of the Polish alphabet.
  3. Try training the synthesizer only for now. Our pretrained English encoder and vocoder models should work well enough.

This is all the help I can provide. Good luck.

wm1511 commented 3 years ago

@PaYo90 I have all the datasets prepared for training polish and polish model trained to 226k steps. If you want any help, tell me.

johndpope commented 3 years ago

Can you upload to mega.nz ? They have free 12gb uploads.

PaYo90 commented 3 years ago

@PaYo90 I have all the datasets prepared for training polish and polish model trained to 226k steps. If you want any help, tell me.

Yeah sure, please: skorpss[AT]gmail.com : ) Send me, if you could tell me couple of tips to do that i appreacie it. Id kile to know the difference between this synthetizer, vodoer and the third one

wm1511 commented 3 years ago

Uploaded. I'll pass if anyone needs. This model is trained for the previous version of synthesizer (Tensorflow). It's not a very good quality, because I couldn't find sufficient amount of speakers for the databases, but I will do my best with the new synthesizer. ;)

PaYo90 commented 3 years ago

you can train new on mozilla common voice

wm1511 commented 3 years ago

I trained on common voice also, but most of them was too silent or had parts of silence and results without alignments (which aren't made) were worse than before training on that dataset. So from nearly 100 000 utterances from commmon voice around 7000 made by 35 speakers have left (set "train").

ghost commented 3 years ago

@wm1511 I describe a way to automatically remove silence in #501. It's also built into the latest version of this repo (set trim_silence = True in synthesizer/hparams.py).

wm1511 commented 3 years ago

@blue-fish great, thanks for the tip. I'll update the synthesizer and train a new model with this option enabled. I hope it will be significantly better quality.