CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.87k stars 8.81k forks source link

Training for a Single Voice after the Update #1041

Open baljeetrathi opened 2 years ago

baljeetrathi commented 2 years ago

Hi,

I want to use the trainer for cloning only my voice. The language would still be English but a different accent than the pre-trained models. Will the instructions mentioned here still work to get good results: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437?

I have a few more questions to get started.

  1. How much data do I need to train for English with a different accent?
  2. The guide here (https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437) mentions a zip file on Dropbox. However, that link no longer works: https://www.dropbox.com/s/bf4ti3i1iczolq5/logs-singlespeaker.zip?dl=0. Is there a new link somewhere else?
  3. In another issue (https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/832), @ghost mentions that:

As you found, the instructions in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437 are for use with the old tensorflow synthesizer that was in use at the time. The new code does not take an explicit command line argument to evaluate every X steps. Instead that is set with hparams.tts_eval_interval, which can be overridden at the command line.

Are the instructions mentioned in 437 no longer valid?

  1. I have following specs:

16GB RAM Intel Core i5 9400F Nvidia 1050Ti

Is that sufficient for training a single voice model?

Thanks. :)

baljeetrathi commented 2 years ago

Hi @sveneschlbeck Would it be possible for you to guide me here?

Thanks. :)

baljeetrathi commented 2 years ago

Hi @raccoonML and @ireneb612 you guys also seem to know how to train a model yourself. Could you help me here?

Thanks. :)

ireneb612 commented 2 years ago

The thing that I would do is to use the pretrained models that work good for english and then finetune on 12 minutes of your voice! You just have to put the data in the right ormat and run the synthesizer_preprocess_audio the syhtnesizer_preprocess_embeds and the synthesizer_train.

I personally used the repository with the older direcotry set up for the saved models, but it's not a big difference, just the path to the saved models now are all in the same directory.

baljeetrathi commented 2 years ago

Thank you very much @ireneb612 . :)

I cloned the repo and my current directory structure is like this:

encoder
samples
synthesizer
toolbox
vocoder
synthesizer_preprocess_audio.py
synthesizer_preprocess_embeds.py
synthesizer_train.py
etc.

Issue 437 mentions the following directions for training:

Here is a [preprocessed p240 dataset](https://www.dropbox.com/s/qskoopjcdjdwuvw/dataset_p240.zip?dl=0) if you would like to repeat this experiment. The embeds for utterances 002-380 are overwritten with the one for 001, as the hardcoding makes for a more consistent result. Use the audio file p240_001.flac to generate embeddings for inference. The audios are not included to keep the file size down, so if you care to do vocoder training you will need to get and preprocess VCTK.

Directions:

    Copy the folder synthesizer/saved_models/logs-pretrained to logs-vctkp240 in the same location. This will make a copy of your pretrained model to be finetuned.
    Unzip the dataset files to datasets_p240 in your Real-Time-Voice-Cloning folder (or somewhere else if you desire)
    Train the model: python synthesizer_train.py vctkp240 dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100
    Let it run for 200 to 400 iterations, then stop the program.
        This should complete in a reasonable amount of time even on CPU.
        You can safely stop and resume training at any time though you will lose all progress since the last checkpoint
    Test the finetuned model in the toolbox using dataset_p240/p240_001.flac to generate the embedding

but the link no longer works so I couldn't figure out the proper format for the files. Could you please help me with that?

Thanks again. :)

ireneb612 commented 2 years ago

https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/819#issue-970736011

I used this ussue to preprocess the Mozilla Common voice dataset!

samoliverschumacher commented 1 year ago

I've made public a repo with a workflow for creating a dataset to perform synthesizer fine tuning.

Not sure if this is the best place to let people know, but hopefully it helps someone.