CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.15k stars 8.73k forks source link

Training a new encoder model #458

Closed ghost closed 3 years ago

ghost commented 4 years ago

In #126 it is mentioned that most of the ability to clone voices lies in the encoder. @mbdash is contributing a GPU to help train a better encoder model.

Instructions

  1. Download the LibriSpeech/train-other-500, and VoxCeleb 1/2 datasets. Extract these to your folder as follows:
    • LibriSpeech: train-other-500 (extract as LibriSpeech/train-other-500)
    • VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv)
    • VoxCeleb2: Dev A - H (extract as VoxCeleb2/dev)
  2. Change model_hidden_size to 768 in encoder/params_model.py
  3. python encoder_preprocess.py <datasets_root>
  4. Open a separate terminal and start visdom
  5. python encoder_train.py new_model_name <datasets_root>/SV2TTS/encoder
CorentinJ commented 4 years ago

Hey guys, just went through this thread quickly.

I indeed removed the ReLU layer in the voice encoder we use at Resemble.AI. I think the model on Resemblyzer still has it. I planned to release a new one which, among other things, wouldn't be trained on data that would have silences at the start and end of each clip.

I don't think I'll update the code in this repo, but I should update the code on Resemblyzer when the new model's released.

mbdash commented 4 years ago

Hi, we are making a group effort on building a new dataset curated and cleaned, quality over quantity.

VoxCeleb is rejected due to it's horrible quality. VCTK might eventually have some bits in it. CommonVoice will be part of it. LibriTTS 100 / 360 / 500 will mostly be the base. (1st iteration)

Join the Slack for more info.

lnguyen commented 4 years ago

@mbdash what slack?

ghost commented 4 years ago

Anyone who wants to contribute in some way to the RTVC project is welcome to join the Slack. Leave a comment in #474 and we will provide an invite link.

CorentinJ commented 4 years ago

VoxCeleb is rejected due to it's horrible quality.

The idea is to have a dataset with low quality though

ghost commented 4 years ago

Mozilla TTS is also developing a speaker encoder in https://github.com/mozilla/TTS/issues/512. I am inviting Mozilla TTS contributors to this discussion to see if we can decide on a common model structure. Also share thoughts on datasets and preprocessing techniques. In a best case situation we could even share the model.

mbdash commented 3 years ago

A small update for anyone watching this thread. @steven and @blue-fish are doing some experimentation with training. I am currently cleaning up datasets to remove noise and artifacts from the source data used to train the models.

I saw @CorentinJ 's comment.

VoxCeleb is rejected due to it's horrible quality.

The idea is to have a dataset with low quality though

We are just playing around, putting our resources in common and experimenting to improve audio output quality as well as adding some punctuation support.

I am done with LibriTTS60/train-clean-100 and progressing through 360

sberryman commented 3 years ago

@mbdash I believe corentin was referring to low quality audio (background noise, static, etc) being important while training the encoder. Clean audio is important for synthesis.

ghost commented 3 years ago

In the SV2TTS paper it is stated that "the audio quality [for encoder training] can be lower than for TTS training" but between this and the GE2E paper I have not seen a statement that it should be of lower quality. The noise might train the network to distinguish based on features that humans can perceive as opposed to subtler differences that can be found in clean audio. But I think it's worth running the experiment to see if this is truly the case.

sberryman commented 3 years ago

@blue-fish I completely agree, running the experiment is the best option.

ghost commented 3 years ago

If anyone is wondering, training is still paused while @mbdash is denoising datasets and @steven850 is doing trial runs to determine best hparams for the encoder. It's a slow process. We plan to swap out the ReLU for a Tanh activation, but will try to match the model structure of the updated Resemblyzer encoder if it is released.

ghost commented 3 years ago

It's going to be a while before we get back to encoder training. I'm going to close this issue for now. Will reopen when we restart.

webbrows commented 3 years ago
  • VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv)
  • VoxCeleb2: Dev A - H (extract as VoxCeleb2/dev)

Hmmm what kind of that i must to download ? *The site has a metadata VoxCeleb and audiofiles