Training a new encoder model

ghost commented 4 years ago

In #126 it is mentioned that most of the ability to clone voices lies in the encoder. @mbdash is contributing a GPU to help train a better encoder model.

Increase the number of hidden layers to 768 as suggested here: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-529235670
All other hparams will be kept default
We will try to strictly follow the instructions for encoder training on the wiki page: wiki/Training

Instructions

Download the LibriSpeech/train-other-500, and VoxCeleb 1/2 datasets. Extract these to your
folder as follows:
- LibriSpeech: train-other-500 (extract as LibriSpeech/train-other-500)
- VoxCeleb1: Dev A - D as well as the metadata file (extract as VoxCeleb1/wav and VoxCeleb1/vox1_meta.csv)
- VoxCeleb2: Dev A - H (extract as VoxCeleb2/dev)
Change model_hidden_size to 768 in encoder/params_model.py
python encoder_preprocess.py <datasets_root>
Open a separate terminal and start visdom
python encoder_train.py new_model_name <datasets_root>/SV2TTS/encoder

ghost commented 4 years ago

@mbdash I will be unavailable this weekend. Hopefully the commands will just work. You can reach out to the community for help, in particular @sberryman who has gone through this in #126 .

There are two things that I hope to learn from training this new encoder model.

Does the voice cloning improve with the new model? (i.e. does increasing hidden layers from 256 to 768 make a difference when it is still projected down to 256 at the end)
Will the new encoder model be compatible with the existing synthesizer?

My hypothesis is that for 1, we will not see a difference unless we also retrain the synth with many more voices. And for 2, that it should be compatible since the dimensions, input data and loss function are not changing. I may very well be wrong on that since I have not studied the encoder in detail.

mbdash commented 4 years ago

update: still preparing the data. I might start the training tomorrow.

mbdash commented 4 years ago

Preprocess started @10h00 EST 2020 08 03

Question: why do I need to start visdom ?

ghost commented 4 years ago

It's optional, but starting a visdom server allows you to visualize the training results by navigating to http://localhost:8097

The umap projections will let us know whether the encoder has learned to distinguish between the voices in the training set. This in turn helps us decide when to stop training.

mbdash commented 4 years ago

update I got a crash. I have to figure out what happens.

The dataset resides on a nfs share on my unraid host. many TB avail, so it is not a lack of space for the dataset. I will force change chown -R user and chmod -R 766 on the whole dataset and try again.

ghost commented 4 years ago

By any chance did you have the text file (<datasets_root>/LibriSpeech/_sources.txt) open in a viewer?

mbdash commented 4 years ago

Nope. It might have been a hickup due to using and nfs share.

I noticed this file rights: -rw-r--r-- 1 99 users Log_LibriSpeech_train-other-500.txt

Rights inheritance might have caused some issues since my user would fall under group:users. Changing the dataset root folder owner recursively instead of relying on group membership should fix the issue.

I'll keep you posted on updates.

mbdash commented 4 years ago

I fixed the previous error (see bottom of comment) but I got another crash in VoxCeleb2:

Here is my current pysoundfile version:

Here is the last files processed:

drwxr-xr-x 1    99 users    8676 Aug  3 21:08 VoxCeleb1_wav_id11249
drwxr-xr-x 1    99 users    3594 Aug  3 20:48 VoxCeleb1_wav_id11250
drwxr-xr-x 1    99 users    2586 Aug  3 20:56 VoxCeleb1_wav_id11251
drwxr-xr-x 1    99 users      24 Aug  3 21:34 VoxCeleb2_dev_aac_id00517
drwxr-xr-x 1    99 users     948 Aug  3 21:34 VoxCeleb2_dev_aac_id00906
drwxr-xr-x 1    99 users     948 Aug  3 21:34 VoxCeleb2_dev_aac_id00924
drwxr-xr-x 1    99 users     864 Aug  3 21:34 VoxCeleb2_dev_aac_id01184
drwxr-xr-x 1    99 users     192 Aug  3 21:34 VoxCeleb2_dev_aac_id02074
drwxr-xr-x 1    99 users     570 Aug  3 21:34 VoxCeleb2_dev_aac_id02477
drwxr-xr-x 1    99 users    1074 Aug  3 21:34 VoxCeleb2_dev_aac_id03184
drwxr-xr-x 1    99 users     948 Aug  3 21:34 VoxCeleb2_dev_aac_id03701
drwxr-xr-x 1    99 users    1074 Aug  3 21:34 VoxCeleb2_dev_aac_id04961
drwxr-xr-x 1    99 users     948 Aug  3 21:34 VoxCeleb2_dev_aac_id06261
drwxr-xr-x 1    99 users     318 Aug  3 21:34 VoxCeleb2_dev_aac_id07417
drwxr-xr-x 1    99 users     108 Aug  3 21:34 VoxCeleb2_dev_aac_id07531

For the previous crash, I took a guess at the issue, my guess is that in encoder/preprocess.py the file handle is kept open for too long (1h30min+) and the file handle get's f'ed up. So I made some mods locally to only open for write the log file during the init, then I open for append for each write / finalizing.

class DatasetLog:
    def __init__(self, root, name):
        self.fpath = Path(root, "Log_%s.txt" % name.replace("/", "_"))
        self.sample_data = dict()
        start_time = str(datetime.now().strftime("%A %d %B %Y at %H:%M"))
        with open(self.fpath, "w") as f:
            self.write_line("Creating dataset %s on %s" % (name, start_time), file_handle=f)
            self.write_line("-----", file_handle=f)
            self._log_params(file_handle=f)

    def _log_params(self, file_handle):
        from encoder import params_data
        self.write_line("Parameter values:", file_handle=file_handle)
        for param_name in (p for p in dir(params_data) if not p.startswith("__")):
            value = getattr(params_data, param_name)
            self.write_line("\t%s: %s" % (param_name, value), file_handle=file_handle)
        self.write_line("-----", file_handle=file_handle)

    def write_line(self, line, file_handle=None):
        if file_handle:
            file_handle.write("%s\n" % line)
        else:
            with open(self.fpath, "a") as f:
                f.write("%s\n" % line)

    def add_sample(self, **kwargs):
        for param_name, value in kwargs.items():
            if not param_name in self.sample_data:
                self.sample_data[param_name] = []
            self.sample_data[param_name].append(value)

    def finalize(self):
        with open(self.fpath, "a") as f:
            self.write_line("Statistics:", file_handle=f)
            for param_name, values in self.sample_data.items():
                self.write_line("\t%s:" % param_name, file_handle=f)
                self.write_line("\t\tmin %.3f, max %.3f" % (np.min(values), np.max(values)), file_handle=f)
                self.write_line("\t\tmean %.3f, median %.3f" % (np.mean(values), np.median(values)), file_handle=f)
            self.write_line("-----", file_handle=f)
            end_time = str(datetime.now().strftime("%A %d %B %Y at %H:%M"))
            self.write_line("Finished on %s" % end_time, file_handle=f)

ghost commented 4 years ago

@mbdash Searching on the error message I came across the suggestion to convert the m4v files to wav which should fix the problem for voxceleb2: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/76#issuecomment-529013562

mbdash commented 4 years ago

@blue-fish I will try to convert them when I have some time. I'll keep you posted.

mbdash commented 4 years ago

@blue-fish m4u to wav conversion in progress. (I was out of commission for a few days) 24k files done while writing this.

ghost commented 4 years ago

@mbdash Although it is preferable to change just one variable with our training experiment, we know that the encoder gets better with more voices so I would like to suggest including the Mozilla CommonVoice dataset, which has over 60k unique English speakers: https://voice.mozilla.org/en

Let's try to incorporate this one if you have the time and patience to preprocess it. @sberryman has written a snippet of code for just that purpose: https://github.com/sberryman/Real-Time-Voice-Cloning/blob/d6ba3e1ec0f950636e9cac3656c0be5c331821cc/encoder/preprocess.py#L224-L244

ghost commented 4 years ago

Some thoughts on the encoder model

Maybe a better encoder is not needed after all, depending on the objective. Although the SV2TTS paper demonstrated the possibility of high-quality zero-shot cloning, I think what most people are after is a high-quality single-speaker TTS. If that is the objective, we have demonstrated in #437 that a decent single-speaker model can be finetuned from the pretrained models at significantly less effort than traditional TTS models. The required dataset goes from 10+ hours to about 10 minutes, a reduction of nearly 2 orders of magnitude.

For this purpose, the speaker encoder acts as a starting point for the finetuning task and the quality of encoding mainly determines how much finetuning is needed. The best case is that no additional training is needed, i.e. high-quality zero shot voice cloning per the SV2TTS paper. The worst case is bounded by the 10+ hours needed to train a single-speaker TTS.

With a better encoder and synthesizer, the required dataset for finetuning can really only go down by 1 order of magnitude: just 1 minute of audio. An reduction of 2 orders of magnitude (dataset of 10 seconds) is equivalent to zero-shot in terms of performance.

While the idea of making a voice with just 1 minute of training data is more appealing than the current 10 minutes, is it an order of magnitude improvement from the perspective of the end user? Or in other words, how much effort is appropriate for the encoder given the potential improvement to be had? Arguably, the encoder is already good enough and our limited resources are better spent on the synthesizer which has a lot of known issues.

mbdash commented 4 years ago

I have begun downloading the Mozilla CommonVoice dataset and will add it to the encoder pretraining.

I am adding the preprocess fn to my version of encoder/preprocess.py (you can note that I hardcoded a default fallback value for lang = lang or 'en'

def preprocess_commonvoice(datasets_root: Path, out_dir: Path, lang=None, skip_existing=False):
    lang = lang or 'en'    
    # simple dataset path
    dataset_name = "CommonVoice/{0}/speakers".format(lang)

    # Initialize the preprocessing
    dataset_root, logger = _init_preprocess_dataset(dataset_name, datasets_root, out_dir)
    if not dataset_root:
        return

    # Preprocess all speakers
    speaker_dirs = sorted(list(dataset_root.glob("*")))

    # speaker_dirs = speaker_dirs[0:4000] (complete)
    # speaker_dirs = speaker_dirs[4000:5000] (complete)
    # speaker_dirs = speaker_dirs[5000:7000] (complete)
    # speaker_dirs = speaker_dirs[7000:8000] (complete)
    # speaker_dirs = speaker_dirs[8000:9000] (in-progress)
    # speaker_dirs = speaker_dirs[9000:] (in-progress)

    _preprocess_speaker_dirs(speaker_dirs, dataset_name, datasets_root, out_dir, "wav",
                             skip_existing, logger)

I also updated my encoder_preprocess.py accordingly.

I will keep you guys updated. (and attempt to create a push for my changes when done)

ghost commented 4 years ago

Following up on my earlier comments, this table is from 1806.04558 (the SV2TTS paper):

Finally, we note that the proposed model, which uses a speaker encoder trained separately on a corpus of 18K speakers, significantly outperforms all baselines

@mbdash It is still worth one attempt to add the 60k speakers from CommonVoice to the encoder and increase the hidden layers to see if we can achieve open-source zero-shot voice cloning that is as good as the results they published. While you're preparing that dataset I will also read the GE2E paper to see if anything else should be changed for this experiment.

Edit: If you think VoxCeleb is too noisy of a dataset we can also try LibriSpeech + CommonVoice, or just CommonVoice alone.

mbdash commented 4 years ago

I am about to begin preprocess on CommonVoice. If there is any modifications you want to make prior to me starting the training, please let me know.

update: mp3 return the same errors as m4a... I guess i will have to convert to wav...

update 2h later: i think i have converted 50%+ to wav

ghost commented 4 years ago

If there is any modifications you want to make prior to me starting the training, please let me know.

Just one mod to make, model_hidden_size = 768 in encoder/params_model.py

ghost commented 4 years ago

@mbdash One more request to make if training has not started yet. I just read https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-660069631 and would like to do the encoder training in 2 phases.

Start training on LibriSpeech + CommonVoice only, it should converge relatively fast, save off the model
Resume training on the model after adding VoxCeleb 1+2 to the training set

The question is, does the phase 2 model perform better than the model from phase 1? I forget how the SV2TTS folder is structured and if this can be easily implemented (maybe you could make a separate datasets_root/SV2TTS/encoder folder and use symbolic links to present only the selected datasets to the encoder). It would be a very interesting data point that would help those who want to do encoder training in the future.

Edit: Given that VCTK is a popular dataset it would be nice to include it either in phase 1 or 1.5 to ensure the resulting model performs well with it. But it is only 110 voices so just a drop in the bucket compared to the others, and not worth holding up the training for it.

mbdash commented 4 years ago

@blue-fish I am still converting CommonVoice mp3 to wav. It has been doing it for hours.

(I began downloading VCTK)

Can I simply move all the Celeb 1&2 folders out of the /SV2TTS/encoder folder and move them back in phase 2? Or should I restart the preprocess only selecting LibriSpeech + CommonVoice?

I did not look deep enough in the code to see if the preprocess does anything else then populating the /SV2TTS/encoder folder

Here is a sample of what the /SV2TTS/encoder folder looks like:

ghost commented 4 years ago

Can I simply move all the Celeb 1&2 folders out of the /SV2TTS/encoder folder and move them back in phase 2?

Good idea @mbdash , I looked at the code and I think that will work.

mbdash commented 4 years ago

Please note that by separating the CommonVoice files in subfolders, the preprocess in interpreting each folder as a different speaker.

I do not know if the training quality will be affected by this since random files from different speakers are mix within those folders.

Also note this error:

The reprocessing seems to be going even with the warning.

ghost commented 4 years ago

I do not know if the training quality will be affected by this since random files from different speakers are mix within those folders.

Thanks for bringing this up @mbdash . The whole point of the speaker encoder is to learn to distinguish voices from different speakers. If the folder name is used to uniquely ID the speaker then mixing will be disastrous. Is there any metadata in CommonVoice that can help sort things out before you preprocess? @sberryman can you share how you preprocessed CommonVoice for encoder training?

Edit: Would this issue still exist if you treat each CommonVoice subfolder as its own dataset?

sberryman commented 4 years ago

@blue-fish and @mbdash

You should cancel your current pre-processing. You need a unique folder per speaker in the pre-processed folder for the encoder. I wrote a little script to pre-process common voice dataset(s) for each language. It was run against a release of CV from at least a year ago. I doubt the format of validated.tsv has changed but just keep that in mind.

Script: https://github.com/sberryman/Real-Time-Voice-Cloning/blob/wip/scripts/cv_2_speakers.py

You'll need to adjust line 26 for the base directory of common voice. One of the arguments to the script is --lang which is just the subfolder for the language. Fairly useless if you plan to hardcode the path on line 26.

The other arguments are for min and max number of audio segments per speaker. Feel free to adjust that based on your needs, I found that minimum of 5 worked well for me.

So this loops over every speaker id in the validated.tsv file and groups the audio clips per speaker into a dictionary. Then it processes each speaker, grabs the first 20 chars of the speaker id and uses that for the path name in the pre-processed directory. Finally it uses ffmpeg to convert the mp3's to wav and downsamples to 16000hz. The sample rate is hardcoded so if you want to adjust that change it on line 93.

It takes a while but works great, I did the entire CV dataset for my encoder (all languages.)

Edit: Also be very careful about lines 60-61. It will rmtree the output path! Edit 2: When this step has finished you can run the encoder pre-processing script against the {base_path}/{lang}/speakers directory.

mbdash commented 4 years ago

@sberryman where can i find the validated.tsv

sberryman commented 4 years ago

@mbdash There is a validated.tsv included in every language download from Common Voice.

For example:

Download Greek dataset
Extract el.tar.gz
base_dir="./cv-corpus-5.1-2020-06-22/{lang}/"
You will see several tsv files, validated.tsv being one of them.

The clips folder is where all the audio clips are stored and what I'm assuming you are running encoder pre-processing against.

My script will create a new folder called speakers in the base_dir and will then create a new sub folder for each speaker which will include the language (el) and the first 20 chars of the speaker_id provided by CV.

Once that script finishes you'll be able to run the encoder pre-processing against the {base_dir}/speakers directory.

ghost commented 4 years ago

@sberryman Thank you so much for the prompt and helpful replies.

mbdash commented 4 years ago

@sberryman yes thank you for your quick response, I guess i might have "misplaced" a folder :-s

I will delete my extracted files, un tar again and be more careful.

thank you

sberryman commented 4 years ago

@mbdash no need to apologize, none of this is really documented. Requires reading through tons of comments on issues, some of which are closed I'm sure.

It is neat to see such strong demand for this project.

I had an idea a while ago to create a platform for cloning facial images and speech and standardize the training process a bit by making it easy to swap out backbone architectures, etc. Then ideally people could join in the project in various capacities. Some might help label new data, add new datasets, contribute their GPU(s) to a training pool, etc. Then we could build a web based UI to interact and run inference on pre-trained models. It is clear to me that the UI aspect of this project made it very approachable to everyone. But fine tuning or changing out datasets is confusing. (I've been working on the visual side recently)

sberryman commented 4 years ago

I should also clarify that the script I linked to does NOT add the language to the speaker specific folders. I just grabs the first 20 chars of the speaker_id provided by CV.

Ensure you are using the modifications I made to encoder/preprocess.py for preprocessing common voice. https://github.com/sberryman/Real-Time-Voice-Cloning/blob/wip/encoder/preprocess.py#L224-L244

mbdash commented 4 years ago

I can confirm I previously deleted the tsv during the process.... I can see them now that i am re-extracting the archive. I will also re-process using your script since you specify a bitrate for wav conversion, which i didnt do when i did it last weekend. I will keep you guys posted.

sberryman commented 4 years ago

@mbdash To be clear, I'm not specifying a bitrate but a sample rate. That sample rate needs to match what you have defined in encoder/params_data.py https://github.com/sberryman/Real-Time-Voice-Cloning/blob/master/encoder/params_data.py#L9

mbdash commented 4 years ago

@sberryman thx for all the details and instructions, I just launched the script.

We will loos e a lot of speakers:

mbdash commented 4 years ago

Now that this one is in progress, any advice for VCTK ?

sberryman commented 4 years ago

@mbdash reducing the pool to 5 was a bare minimum in my opinion. You wan't the same speaker speaking multiple phrases ideally with different environments and microphones. Considering Common Voice is free, it isn't surprising 21,494 people haven't provided 5 or more samples.

I did way more than Common Voice and Librispeech. If you look through my WIP branch you'll see quite a few scripts to help process other datasets. Most of the other datasets were extremely easy to pre-process though.

VCTK: https://github.com/sberryman/Real-Time-Voice-Cloning/blob/wip/encoder/preprocess.py#L198-L210

sberryman commented 4 years ago

My entire goal of working with this project was the embedder (edit: encoder). If you want to learn a lot more about the embeddings you can read my conversation with Corentin on this issue: https://github.com/resemble-ai/Resemblyzer/issues/13

ghost commented 4 years ago

@sberryman I had no idea you already trained an English model with (768 hidden size / 256 output size) like we propose to do here. Are you able to share your model with us? We could resume the training where it left off.

sberryman commented 4 years ago

@blue-fish, I trained a model with 768 hidden and 768 output. I didn't drop the output back down to 256d (I had started training before Corentin replied.) Or maybe I did and completely forgot about it as I trained so many variations.

I know you read through my issue #126 as you commented on it. So remember, I only trained the encoder model. I didn't have any lucky training the synthesizer or vocoder. I honestly don't remember what languages the uploaded model was trained with but it did perform better than Corentin's pertained model for my task of speaker identification.

Model: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-531350334

You can also try this, I'm assuming it is the English only model 😁 https://www.dropbox.com/s/3xvjobg130x0tr9/english_run.pt?dl=0

ghost commented 4 years ago

@sberryman I am inquiring about the 768/256 English model that you started training here: https://github.com/resemble-ai/Resemblyzer/issues/13#issuecomment-551269980 Right before you closed the issue it reached 922,500 steps. The dropbox model you linked is trained to 2,143,500 steps, not the right one (it looks like 768/768).

It must be nice to be so productive that you don't fully remember the details of everything you've worked on! 😄

sberryman commented 4 years ago

@blue-fish I just went through all the files I saved from training and I can't find the English model with 768/256.

ghost commented 4 years ago

@sberryman Sorry! I just noticed in the screenshots that the visdom environment says "mixed" so it looks like 768/256 is a mixed encoder. It was my wishful thinking to think it was English, but we will take it and finetune it from there if you are willing to dig it up.

sberryman commented 4 years ago

@blue-fish I don't think an encoder trained on multiple languages would impact the synthesizer or vocoder. But that is just a hunch, I haven't run an experiment to test that theory.

ghost commented 4 years ago

@sberryman I share the same hunch and will run the experiment if you are able to locate the file. I even think it might work with Corentin's synth and vocoder.

The synth is the tricky part, it depends if the speaker embeds come out looking the same as the current encoder. But as I've said before I think the loss function will make your encoder output like the current one.
A vocoder trained on multispeaker data will work well with any speech spectrograms, even those in a different language. See https://github.com/mozilla/TTS/issues/221 and this even continues to be worked on today: https://discourse.mozilla.org/t/training-a-universal-vocoder/65388

Apologies for sending you on the wild goose chase earlier, but we still don't have the mixed encoder with (768 hidden/256 model) size that you developed in Nov-Dec 2019. If you can find the files from that experiment it will save us several weeks of compute time.

In #126 you shared:

Mixed, 697,500 steps (256 hidden, 768 model) - this is backwards from what we want https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-530832810
Mixed, 1,005,000 steps (256 hidden, 768 model) - same as above https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-531350334
Mixed, 2,215,200 steps - did not check size but I know that it can't be 768 hidden/256 model as you only started training it in November and this was released back in September 2019. https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126#issuecomment-536097366

mbdash commented 4 years ago

I am re-downloadeing VCTK, maybe i got the wrong version, since the dataset i have is all .flac and not wav

ghost commented 4 years ago

@mbdash That is not a mistake. VCTK comes in .flac . It will need to be converted. I suggest exclusively using mic1 since there were a few speakers for whom mic2 failed.

mbdash commented 4 years ago

alright, i just launched the conversion from flac to wav and am only keeping mic1 files.

Update: beginning pretraining of CommonVoice, will do vctk afterwards. (doing it separately just in case...)

mbdash commented 4 years ago

Hi, I tried to start the training

Here is the error I got, I will look at the code when I have some time.

Update: ok it doesnt seem connected to the error above, but I already can see that the log files from preprocessing are used and I got some invalid/empty ones

ghost commented 4 years ago

Can you try making a separate test folder with a small subset of speakers and see if the training script will work on that? I recall having to experiment a bit to get it to work. (I have never trained an encoder, but I have tested the scripts to make sure training will work on CPU)

mbdash commented 4 years ago

I think I will preprocess again all the dataset which the logs failed.... And then try again 1 DS at the time like you proposed

ghost commented 4 years ago

@sberryman Pinging you again just in case you did not see https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-672039095

Specifically, if you can locate the 768/256 (hidden/model) mixed encoder from Nov-Dec 2019 that would be very much appreciated. All that is previously shared in #126 for mixed is 256/768.

ghost commented 4 years ago

Just had an idea @mbdash . We can take the English 768/768 model in https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/458#issuecomment-671675613 , and copy the weights for the LSTM layers into our new model. What needs to match is the input size (mel_n_channels=40) and the number of hidden layers which we're changing to 768.

Then we will fix those weights and only train the nn.Linear at the end which takes the last hidden layer (size 768) and projects it down to model_embedding_size=256. Once the loss comes down to an acceptable level then we will test the performance of the resulting model and resume training of the full model (allowing the LSTM weights to change) if not satisfied.

There are only 196,684 parameters to train (=768*256 + 256 for the nn.Linear bias). It should train quickly.

CorentinJ / Real-Time-Voice-Cloning

Training a new encoder model #458

Instructions

Some thoughts on the encoder model