CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.55k stars 8.79k forks source link

Help making Italian Vocoder/Synthesizer #697

Closed xzVice closed 3 years ago

xzVice commented 3 years ago

Let's suppose I got the Italian dataset from here (ASR one, flac) http://www.openslr.org/94/ How am I supposed to create all the pretrained models from it (the .pt files, for vocoder, synthesizer and encoder)?

ghost commented 3 years ago

Please start by reading my advice on training. This contains the link to training documentation: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/431#issuecomment-673555684

If I were doing this, I would reuse the encoder and vocoder models. For the synthesizer, you have the option of training from scratch or finetuning the English model. Training from scratch should give better pronunciation and prosody. Finetuning will reduce training time and possibly have better voice similarity. If you finetune, modify the text cleaner to remove diacritics from vowels (change à to a, è and é to e, etc.). This is necessary since the English synthesizer does not include these characters in symbols.py.

xzVice commented 3 years ago

Please start by reading my advice on training. This contains the link to training documentation: #431 (comment)

If I were doing this, I would reuse the encoder and vocoder models. For the synthesizer, you have the option of training from scratch or finetuning the English model. Training from scratch should give better pronunciation and prosody. Finetuning will reduce training time and possibly have better voice similarity. If you finetune, modify the text cleaner to remove diacritics from vowels (change à to a, è and é to e, etc.). This is necessary since the English synthesizer does not include these characters in symbols.py.

So, I tried doing what you told me to do and everything was doing well until the synthesizer_train.py command... Here is the execution of all the commands contained there https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training (till the train one ofc, which thrown the error) Any idea? 🤔

I also noticed those weird symbols inside the SV2TTS/synthesizer/train.txt file... image Is it normal? I tried to edit the symbols.py/cleaners.py files but doing that didn't fix it... but anyways this is probably not what's causing the crash of the train command...

C:\Users\Workspace\Desktop\Real-Time-Voice-Cloning>py -3.6 synthesizer_preprocess_audio.py datasets_root --datasets_name LibriTTS --subfolders testing --no_alignments
Arguments:
    datasets_root:   datasets_root
    out_dir:         datasets_root\SV2TTS\synthesizer
    n_processes:     None
    skip_existing:   False
    hparams:
    no_alignments:   True
    datasets_name:   LibriTTS
    subfolders:      testing

Using data from:
    datasets_root\LibriTTS\testing
LibriTTS: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:09<00:00,  9.52s/speakers]
The dataset consists of 9 utterances, 7450 mel frames, 1488960 audio timesteps (0.03 hours).
Max input length (text chars): 140
Max mel frames length: 889
Max audio timesteps length: 177600

C:\Users\Workspace\Desktop\Real-Time-Voice-Cloning>python synthesizer_preprocess_embeds.py datasets_root/SV2TTS/synthesizer
Arguments:
    synthesizer_root:      datasets_root\SV2TTS\synthesizer
    encoder_model_fpath:   encoder\saved_models\pretrained.pt
    n_processes:           4

Embedding:   0%|                                                                         | 0/9 [00:00<?, ?utterances/s]Loaded encoder "pretrained.pt" trained to step 1564501
Loaded encoder "pretrained.pt" trained to step 1564501
Loaded encoder "pretrained.pt" trained to step 1564501
Loaded encoder "pretrained.pt" trained to step 1564501
Embedding: 100%|█████████████████████████████████████████████████████████████████| 9/9 [00:05<00:00,  1.73utterances/s]

C:\Users\Workspace\Desktop\Real-Time-Voice-Cloning>python synthesizer_train.py testing datasets_root/SV2TTS/synthesizer
Arguments:
    run_id:          testing
    syn_dir:         datasets_root/SV2TTS/synthesizer
    models_dir:      synthesizer/saved_models/
    save_every:      1000
    backup_every:    25000
    force_restart:   False
    hparams:

Checkpoint path: synthesizer\saved_models\testing\testing.pt
Loading training data from: datasets_root\SV2TTS\synthesizer\train.txt
Using model: Tacotron
Using device: cpu

Initialising Tacotron Model...

Trainable Parameters: 30.876M

Starting the training of Tacotron from scratch

Using inputs from:
        datasets_root\SV2TTS\synthesizer\train.txt
        datasets_root\SV2TTS\synthesizer\mels
        datasets_root\SV2TTS\synthesizer\embeds
Found 9 samples
+----------------+------------+---------------+------------------+
| Steps with r=2 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
|   20k Steps    |     12     |     0.001     |        2         |
+----------------+------------+---------------+------------------+

Traceback (most recent call last):
  File "synthesizer_train.py", line 35, in <module>
    train(**vars(args))
  File "C:\Users\Workspace\Desktop\Real-Time-Voice-Cloning\synthesizer\train.py", line 158, in train
    for i, (texts, mels, embeds, idx) in enumerate(data_loader, 1):
  File "C:\Users\Workspace\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 355, in __iter__
    return self._get_iterator()
  File "C:\Users\Workspace\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 301, in _get_iterator
    return _MultiProcessingDataLoaderIter(self)
  File "C:\Users\Workspace\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\utils\data\dataloader.py", line 914, in __init__
    w.start()
  File "C:\Users\Workspace\AppData\Local\Programs\Python\Python36\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "C:\Users\Workspace\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "C:\Users\Workspace\AppData\Local\Programs\Python\Python36\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\Users\Workspace\AppData\Local\Programs\Python\Python36\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "C:\Users\Workspace\AppData\Local\Programs\Python\Python36\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'train.<locals>.<lambda>'

C:\Users\Workspace\Desktop\Real-Time-Voice-Cloning>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\Workspace\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\Users\Workspace\AppData\Local\Programs\Python\Python36\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
ghost commented 3 years ago

I don't have time to fully troubleshoot issues, but this may help. If not, you'll need to figure it out yourself.

Weird characters in train.txt

Problem may be coming from this line, which reads the transcripts: https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/b5ba6d0371882dbab595c48deb2ff17896547de7/synthesizer/preprocess.py#L77

Try adding utf-8 file encoding.

with text_fpath.open("r", encoding="utf-8") as text_file:

Error running synthesizer_train.py

For a soluton to:

AttributeError: Can't pickle local object 'train.<locals>.<lambda>'
EOFError: Ran out of input

Please see https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/669#issuecomment-781130738 for a workaround. We set num_workers=0 on Windows.

xzVice commented 3 years ago

Thanks! Now both errors got solved... but it's really slow (the 20000 steps train command)... also idk why it says Using device: cpu even tho I installed the latest cuda toolkit and I got a GTX 1050 Ti...

xzVice commented 3 years ago

Nevermind I had the cpu version of pytorch installed...

AVTV64 commented 3 years ago

Let's suppose I got the Italian dataset from here (ASR one, flac) http://www.openslr.org/94/ How am I supposed to create all the pretrained models from it (the .pt files, for vocoder, synthesizer and encoder)?

HI, can you release the Italian models you trained? How do I set it up? I want to clone voices in this language.

frossi65 commented 3 years ago

@arianaglande hello, i am looking for italian models. let me know if i can help to train the model. i have a rtx2070 gpu.

FedericoFedeFede commented 3 years ago

@arianaglande I'm also looking for it. If you managed to do that, it would be very helpful sharing that with us. Thanks

TalissaDreossi commented 3 years ago

I'm trying to do the same and as @blue-fish said (if I got it correct) I just need to train the synthesizer so I have to skip the first steps in https://github.com/CorentinJ/Real-Time-Voice-Cloning/wiki/Training#datasets until I reach the:
"Begin with the audios and the mel spectrograms:
_python synthesizer_preprocess_audio.py ".
Is it right? If so, how have I to structure my dataset? I have downloaded the italian one from http://www.openslr.org/94/ but I don't know if I have to preprocess it before running the instruction above (in other words I don't know what it is expected in
_) Thanks in advance

alessandrolamberti commented 2 years ago

@arianaglande Hi, how did you manage to preprocess the italian dataset into the format the scripts accept?

Alex2610 commented 1 year ago

can please someone upload the pretrained models?