CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.12k stars 8.72k forks source link

Training from scratch #126

Closed sberryman closed 3 years ago

sberryman commented 5 years ago

Thanks for publishing the code and basic training instructions!

Environment

Datasets: (9,063 speakers)

I'm working on adding TEDLIUM_release-3 which would add 1,925 new speakers and potentially SLR68 which would add 1,017 Chinese speakers but would require some clean up as there is a lot of silence in the audio files.

Hyper Parameters: Left all parameters untouched.

Encoder training:

39,300 steps: image

115,900 steps: (almost exactly 24 hours of training) image

Typical step

Step 115950   Loss: 0.9941   EER: 0.0717   Step time:  mean:   889ms  std:  1320ms

Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:  449ms   std: 1317ms
  Data to cuda (10/10):                            mean:    3ms   std:    0ms
  Forward pass (10/10):                            mean:    8ms   std:    2ms
  Loss (10/10):                                    mean:   67ms   std:    7ms
  Backward pass (10/10):                           mean:  237ms   std:   26ms
  Parameter update (10/10):                        mean:  118ms   std:    3ms
  Extras (visualizations, saving) (10/10):         mean:    6ms   std:   18ms

Questions

  1. Will adding an additional ~2,900 speakers make much of a difference for the encoder?
    1. Will adding the remaining LibriTTS datasets (train-clean-100, train-clean-360, dev-clean, dev-other) with 1,221 speakers have any adverse effects training the synthesizer and vocoder?
  2. Does using different languages in the encoder help or hurt?
  3. Does my encoder training thus far look okay? It appears it will take me roughly 7 days to train the encoder up to 846,000 steps.
  4. Can I train the encoder using 16,000Hz while training the synthesizer and vocoder using 24,000Hz? Or do I need to restart and train the encoder on 24,000Hz mel spectrograms?
  5. I've downloaded the source videos for TEDLIUM-3 so I can extract audio at up to 44,100Hz allowing me to expand the synthesizer and vocoder training dataset to TEDLIUM + LibriTTS at 24,000Hz.
  6. Based on other issues I've read it appears you would like to use factchord taco1 implementation. Would you advice I go that route vs nvidia's taco2 pytorch implementation?
CorentinJ commented 5 years ago

Great work and great questions! I'll pin this issue for others in need of help.

Firstly, one thing I notice from your profiler output is that you would benefit from a 2x speedup by putting your data on a faster disk (or maybe increasing the number of threads in the DataLoader if you set them too low)

  1. Yes, adding more speakers is always good. Not including the entire LibriSpeech dataset was, I believe, a deliberate choice of the SV2TTS authors to highlight the transfer learning aspect of their framework i.e. that the speaker encoder trained on some data will perform well on entirely new data (and a different purpose too).
  2. That's a difficult question. Ideally you would have English-only speakers with a wide range of accents. I can't say that I have a definitive answer, however if you were to include a wide variety of languages I would recommend moving the speaker embedding size from 256 to 768 (as is done in SV2TTS). You could also do that for English-only speakers, simply I have found 256 to work well so far. A formal evaluation would require to compute the EER, and that is still a grey area for me (see the end of section 3.3.3 of my thesis)
  3. Yes, your training looks like mine. You will see the clusters get tighter over time and the loss will continue decreasing steadily. If you have time, you can train for longer than I did (as I did not converge to 100%)
  4. You're technically perfectly fine with different sample rates. Simply, any 24kHz audio you load/generate can be resampled (using librosa's resample function) to 16kHz for the encoder. I haven't tested the repo with different sampling rates, but I think I tried to make it possible to have different ones. I know there's an issue with that in the toolbox (https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/toolbox/__init__.py#L215), we can try to fix it when you need it.
  5. Ok, I didn't know about that dataset but it seems promising.
  6. I would greatly appreciate if someone were to replace entirely the synthesizer with a pytorch one. Both fatchord's and nvidia's would be fine.
sberryman commented 5 years ago

Thanks for the quick reply!

I also noticed the blocking operation taking a long time, found it very strange as the mel spectrograms are stored on a Samsung 960 EVO 1TB NVMe drive and SpeakerVerificationDataLoader has num_workers=16 CPU bounces around from about 50-80% utilization and disk is showing 4-18% busy. nvidia-smi is showing low utilization. Maybe I completely glossed over the code where you are reading from the wav audio files during training? That would explain it as wav's are sitting on a slow spinning disk.

  1. Thanks, I'll work on adding in TEDLIUM-3 into the encoder training set.
  2. I'll restart training with an embedding size of 768 by adjusting model_embedding_size = 768 in https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder/params_model.py#L4. Would you adjust the model_hidden_size or any other parameters?
  3. I have noticed them continue to tighten up even with multiple (very diverse) languages and an embedding size of 256.
  4. Good to know on different sampling rates. Do you think I would be better off up-sampling the 16kHz to 24kHz for the embedding and down-sampling the remaining to 24kHz? VoxCeleb(1/2) and VCTK are in 16 kHz while the remaining speakers are in 24 kHz or ~44kHz.
  5. It is a great dataset with a wide range of accents, they only provide the data in 16kHz but it is easy to find the source videos and extract 44kHz audio that aligns perfectly.
  6. Once I get to synthesizer training I'll replace your code with fatchord's or nvidia's.

Edit: The other thing I thought about for speeding up IO would be stacking the numpy files for each speaker into a single file as sequential reading is much faster. I would only have to open 10 files per step vs 100. I have plenty of memory in my computer I'm using for training so maybe that wont be an optimization many others could benefit from?

Edit 2: I've gone through all the numpy files for each speaker and saved them into a combined file using np.savez and adjusted the code in encoder/data_objects/speaker.py and encoder/data_objects/utterance.py I'm now getting a much more consistent and lower load time for the data. Obviously increasing the embedding size from 256 to 768 has almost tripled the backward pass duration. Funny enough my overall step time has remained about the same but the embedding size tripled. So I consider that a win!

Step   1030   Loss: 3.2002   EER: 0.2662   Step time:  mean:   871ms  std:    58ms

Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:  103ms   std:   26ms
  Data to cuda (10/10):                            mean:    3ms   std:    0ms
  Forward pass (10/10):                            mean:    7ms   std:    1ms
  Loss (10/10):                                    mean:   73ms   std:    3ms
  Backward pass (10/10):                           mean:  569ms   std:   67ms
  Parameter update (10/10):                        mean:  116ms   std:    3ms
  Extras (visualizations, saving) (10/10):         mean:    1ms   std:    4ms

Edit 3: I wasn't happy with the backward pass duration so I made the backwards pass run on the GPU. This is what I'm looking at now...

Step    310   Loss: 3.6576   EER: 0.3275   Step time:  mean:   425ms  std:   233ms

Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:  104ms   std:  122ms
  Data to cuda (10/10):                            mean:    3ms   std:    0ms
  Forward pass (10/10):                            mean:   39ms   std:    1ms
  Loss (10/10):                                    mean:   23ms   std:    1ms
  Backward pass (10/10):                           mean:   80ms   std:    5ms
  Parameter update (10/10):                        mean:  121ms   std:    2ms
  Extras (visualizations, saving) (10/10):         mean:    1ms   std:    3ms

..........
Step    320   Loss: 3.6723   EER: 0.3339   Step time:  mean:   322ms  std:    98ms

Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:   60ms   std:   97ms
  Data to cuda (10/10):                            mean:    3ms   std:    0ms
  Forward pass (10/10):                            mean:   39ms   std:    0ms
  Loss (10/10):                                    mean:   22ms   std:    1ms
  Backward pass (10/10):                           mean:   77ms   std:    4ms
  Parameter update (10/10):                        mean:  121ms   std:    2ms
  Extras (visualizations, saving) (10/10):         mean:    2ms   std:    4ms

..........
Step    330   Loss: 3.6419   EER: 0.3309   Step time:  mean:   362ms  std:   140ms

Average execution time over 10 steps:
  Blocking, waiting for batch (threaded) (10/10):  mean:   97ms   std:  139ms
  Data to cuda (10/10):                            mean:    3ms   std:    0ms
  Forward pass (10/10):                            mean:   39ms   std:    1ms
  Loss (10/10):                                    mean:   24ms   std:    3ms
  Backward pass (10/10):                           mean:   78ms   std:    4ms
  Parameter update (10/10):                        mean:  121ms   std:    1ms
  Extras (visualizations, saving) (10/10):         mean:    1ms   std:    3ms
csu-xiao-an commented 5 years ago

thank you

CorentinJ commented 5 years ago
  1. Yes, sorry, you should adjust the hidden layer size as well. The way it is done in the GE2E paper is that all recurrent layers have an output of 768, but are projected down to 256 dimensions before being fed to the next. If you want to implement that you'll have to change the network architecture; but if it trains fast enough with 768 as hidden size, then you're fine.
  2. Oh it's definitely going to work fine on different languages. The question is whether you'll manage to achieve an EER as low as on a single language dataset, and by extension a voice transfer that is just as good.
  3. Hmm, you can give that a shot. You should listen to the quality of downsampled/upsampled audios to see what gives (you can do that in a REPL prompt with sounddevice)
  4. I disagree. A whole lot of the source videos were removed from youtube. I know because I tried to guess the source language from the source videos.
  5. Great. I personally recommend fatchord's (I have played around and analyzed both repos already). If you feel like Tacotron 1 might be a downgrade from Tacotron 2, know that it isn't - Tacotron 1 is still used more often than Tacotron 2 in the litterature. Fatchord's samples are also great. Know that if you reimplement the synthesizer, you will probably have to change some things so that the data format on the vocoder side is good. We can talk about that again then.

There are quite a few ways to gain disk reading speedups for the encoder, but don't forget that you still need variety in the samples/batches. Another bottleneck is the GPU VRAM not being entirely used. Since the complexity of the forward/backward pass is cubic w.r.t the batch size, you would need to put multiple batches in parallel on the same GPU rather than putting a larger batch size. It's something worth looking into.

I had no idea you could specify to run the backward pass on the gpu, how did you do that?

sberryman commented 5 years ago

Thanks for the continuous feedback.

  1. Unfortunately I didn't have the patience to wait for your response so it has been training with the model and data parameters shown below.
## Model parameters:
learning_rate_init: 0.0001
model_embedding_size: 768
model_hidden_size: 256
model_num_layers: 3
speakers_per_batch: 64
utterances_per_speaker: 10

## Data parameters:
audio_norm_target_dBFS: -30
inference_n_frames: 80
mel_n_channels: 40
mel_window_length: 25
mel_window_step: 10
partials_n_frames: 160
sampling_rate: 16000
vad_max_silence_length: 6
vad_moving_average_width: 8
vad_window_length: 30
  1. I trained with ~9,000 speakers (mixed languages but mostly English) through step 352,600 and included the UMAP projections for that below. I then remembered the Common Voice project from Mozilla and downloaded the entire thing. Then I placed all the individual speakers into unique folders and pruned all the speakers that didn't have 10 or more utterances. I then resumed training with the combined datasets bringing the total speakers to 25,668. stack_run_umap_358500 stack_run_umap_358600

  2. Thanks but I'll hold off on changing sample rate for now, already adjusting a lot.

  3. I didn't download them from YouTube, they are available for download from TED.com at https://www.ted.com/talks/quick-list?page=1 and the alignments match TEDLIUM-3. The transcripts available from TED are of higher quality than the ones in TEDLIUM-3 dataset but alignments don't match due to the TED splash screen/banner that plays in the beginning.

  4. Sounds good, Fatchord's version it is! Perfect timing as another person using this repository (@TheButlah) has just made a lot of improvements and included multi-gpu training.

The combined npz files have been working great for me, it will load all the utterances per speaker and still uses your same sampling code to grab a random sample per speaker. The only thing I removed is loading from individual npy files.

I assume I changed the backwards pass to GPU, either way the GPU utilization is much higher and the profiler is showing significantly lower mean duration's for "Backward pass". I changed the loss_device to run on the GPU.

Then on https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder/model.py#L27-L28

self.similarity_weight = nn.Parameter(torch.tensor([10.]).to(loss_device))
self.similarity_bias = nn.Parameter(torch.tensor([-5.]).to(loss_device))

Simply moved the tensor not the parameter to the GPU and changed the GPU sync in train.py to:

def sync(device: torch.device):
    # FIXME
    # return
    # For correct profiling (cuda operations are async)
    if device.type == "cuda":
        # torch.cuda.synchronize(device)
        torch.cuda.synchronize()

I'm now up to step 447,200 and included the loss and UMAP to show progress. I also changed the UMAP visualization to show 30 speakers by adding more colors to the color map.

37b6c18ff7d7d4 37b6c19208f994

cv_run_umap_447200 cv_run_umap_447300

New color map

colormap = np.array([
    [32, 25, 35],
    [255, 255, 255],
    [252, 255, 93],
    [125, 252, 0],
    [14, 196, 52],
    [34, 140, 104],
    [138, 216, 232],
    [35, 91, 84],
    [41, 189, 171],
    [57, 152, 245],
    [55, 41, 79],
    [39, 125, 167],
    [55, 80, 219],
    [242, 32, 32],
    [153, 25, 25],
    [255, 203, 165],
    [230, 143, 102],
    [197, 97, 51],
    [150, 52, 28],
    [99, 40, 25],
    [255, 196, 19],
    [244, 122, 34],
    [47, 42, 160],
    [183, 50, 204],
    [119, 43, 157],
    [240, 124, 171],
    [211, 11, 148],
    [237, 239, 243],
    [195, 165, 180],
    [148, 106, 162],
    [93, 76, 134],
    [0, 0, 0],
    [183, 183, 183],
], dtype=np.float) / 255
CorentinJ commented 5 years ago

Ah, I had put a warning not to compute the loss on GPU because for some reason it wasn't working (either it was some intricacies with torch or I forgot to enable grad on some tensor) and would return None. If that works, then I should update the repo to make it the default and have only 1 device for the encoder.

sberryman commented 5 years ago

You are correct, it was not working until I changed the two lines to move the tensor to the GPU not the parameter. That was all I had to change (I believe, if not I can dig through all my changes and help you isolate that fix.) Technically I changed loss_device to loss_device = device just so I didn't miss anything in train.py. Either way, only one GPU is exposed to my docker container used for training.

Also in the sync function, I had to remove the device parameter and simply use torch.cuda.synchronize()

Clusters are getting tighter but I plan on training until at least 700-900k steps. I'm also tempted to train an English only model to compare.

TheButlah commented 5 years ago

@sberryman will you be submitting a pull request? Id be very interested to see the results using more data for the speaker encoder - the GE2E paper demonstrated that having more data for the encoder is critical to getting the similarity of the cloned speaker close to the original.

Also in my own experience, the compatibility of Fatchords Taco1 with WaveRNN makes it a great candidate, and the codebase is easy to work with. I still believe that Taco2 would be an upgrade in terms of quality of the inflection of the speaker, but that the out of the box compatibility of Fatchords synthesizer with the vocoder makes it a natural choice.

Do note that Fatchords synthesizer does not support multiple speakers, so you would need to add that capability yourself (and a PR on Fatchords repo would be especially appreciated for adding that capability :) )

ViktorAlm commented 5 years ago

I'm also very interested in the results. I'm currently training the encoder on about 2k speakers in Swedish and about 4k mixed mainly English. I would really like to see examples from your encoder model on multiple languages to see if its worth crawling radio and tv shows with resemblyzers diarization to create a a fully Swedish dataset or if 6k with 1/3 being Swedish can compare to 25k mixed mainly english for Swedish voice cloning. My hunch is m0ar data

sberryman commented 5 years ago

Current:

I'm at ~700k steps and still quite a few tight clusters, not sure if this is due to the fact that I trained for 350k steps on 9,000 speakers prior to adding 16,668 more speakers (which also introduced quite a few more languages) I'm going to continue training for another 200k steps which will be done this time tomorrow morning.

image image image image

To-Do:

  1. I'm going to start a new training for English only (there are a few non-english speakers) with ~17,680 speakers using 768/768 (hidden/embedding) size.
  2. Once the mixed set reaches ~900k steps I will stop it and start it over from scratch with 768/768 as it is currently training on 256/768 (256 hidden and 768 embedding size) as I wasn't aware I had to bump both to 768.

Comments

@TheButlah

First, thanks for the massive PR that landed on Fatchords WaveRNN 4 days ago, really excited you added mutli-gpu training and mel's in numpy format! To your question on a PR, I can certainly submit PRs to this repo and WaveRNN. The code to utilize most of the datasets from OpenSLR and Common Voice are bit of a hack but if people want them I'm open to working on a PR for that as well.

Thanks for the feedback on Taco1 and WaveRNN from Fatchords repo, that will be the route I will go. I will most likely run into issues adding multi-speaker but I will start an issue in that repo when I get there.

@ViktorAlm

Great to hear about someone else testing multiple languages! Have you changed any of the data or model parameters? Funny you mentioned using Resemble's diarization as I've had a tab open to that code for a few days and planned on using it against 7,000 hours of local (English) news video I have. That is once I finished training a new model.

As far as sharing the models I'm training, I'm open to it. Here is the model trained to 697,500 steps (768 model embedding size and 256 hidden layers.) https://www.dropbox.com/s/2b5g2rt4vypx9qq/cv_run_bak_697500.pt?dl=0

Would be interested to know how it performs against your Swedish data @ViktorAlm.

ViktorAlm commented 5 years ago

Thanks! I have not changed any params. I was on step 150k with my data to try and do a real run with all the models. I did one where I only did 100k steps on each model with about 900 swedish speakers with about 90gb data in total. It did not clone the voice but produced a good audio quality and atleast a male voice came out when I ran my own voice. I paused it and did a quick test with yours and the encoding result is way better than the small testrun I did.

Swedish and Norwegian are pretty similar. I didnt see any specific Swedish/Norwegian cluster gathering but I only did two tests and umap might remove any visible difference I guess.

Heres a converter if you wish to add norwegian, danish and swedish data to your mix: https://github.com/ViktorAlm/Nasjonalbank-converter

I also added some results from your encoder in /Results.

When i've played around a bit more i might make a script that evaluates different languages better.

sberryman commented 5 years ago

@ViktorAlm Thanks for sharing!

Is your Swedish and Norwegian dataset private? I'm up for including those speakers in the next training run where I use 768 for hidden/embedding size if you can share. There are only 20 Swedish voices in the 25,668 speakers I am training on and zero Norwegian. Common voice had 44 speakers for Swedish but I filtered those down to 20 as I had a floor of 12 unique utterances per speaker.

Other updates

  1. I started the English only 768/768 training which takes significantly longer per step (about 4x) so don't expect those results for a while. Progress looks good so far though and it is only on step 8,100! image image image image
  2. I've reduced the learning rate from 1e-4 to 1e-5 on the mixed dataset which seems to help. I'll probably drop it down to 1e-6 around step 800-850k.

If anyone else is aware of other datasets I can include please let me know!

ViktorAlm commented 5 years ago

Nice!

I edited my old comment because i did not want to clutter your thread with my bad screenshots. I added my converter with links to the datasets. Its very hacky and if you want to add them i really should clean up the code some. I think a simple merge of the folders and then looping through to get the spls(files with info on location etc) and loading the files would be the best way instead of my weird way of scanning the folders. I was testing on just one of the extracted folders and the speech folders did not contain the wavs which was specified in the spl file. Then everything went weird from there.

https://github.com/ViktorAlm/Nasjonalbank-converter

CorentinJ commented 5 years ago

Just in case this wasn't clear, Resemblyzer is also my project and is merely an interface to the speaker encoder of this repo. You can replace the pretrained model in the package and put yours instead. I could also distribute models that you provide me for other languages.

CorentinJ commented 5 years ago

I also would like to leave my script for evaluating the EER over the test set. It's not clean and I'm not sure if it's correct either (given that you won't find anywhere the right procedure to evaluate the EER over a dataset). You should use this if you want to formally evaluate the performance of the speaker encoder.

If someone manages to make it better then I would gladly include it in the repo

from encoder.data_objects import SpeakerVerificationDataLoader, SpeakerVerificationDataset
from encoder.model import SpeakerEncoder
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import torch

# This is my script for computing the test EER.
dataset_root = r"E:\Datasets\SV2TTS\encoder_test"

if __name__ == '__main__':
    speakers_per_batch = 32
    steps = 100

    dataset = SpeakerVerificationDataset(Path(dataset_root))

    model = SpeakerEncoder(torch.device("cuda"), torch.device("cpu"))
    checkpoint = torch.load("saved_models/pretrained.pt")
    model.load_state_dict(checkpoint["model_state"])
    model.eval()

    results = []
    for utterances_per_speaker in range(6, 8):
        loader = SpeakerVerificationDataLoader(
            dataset,
            speakers_per_batch=speakers_per_batch,
            utterances_per_speaker=utterances_per_speaker,
            num_workers=8,
        )
        with torch.no_grad():
            eers = []
            for step, speaker_batch in zip(range(1, steps + 1), loader):
                inputs = torch.from_numpy(speaker_batch.data).cuda()
                embeds = model(inputs)
                embeds_loss = embeds.view((speakers_per_batch, utterances_per_speaker, -1)).cpu()
                _, eer = model.loss(embeds_loss)

                eers.append(eer)
                print("Step %d    EER: %.3f" % (step, np.mean(eers)))
        results.append(np.mean(eers))

    plt.plot(range(2, 11), results)
    plt.xlabel("Enrollment utterances")
    plt.ylabel("Equal Error Rate")
    plt.show()
CorentinJ commented 5 years ago

Also I don't know about that:

I've reduced the learning rate from 1e-4 to 1e-5 on the mixed dataset which seems to help. I'll probably drop it down to 1e-6 around step 800-850k.

  1. I've left my lr to 1e-4 all along, I think you should be fine with that same value as well

  2. Don't forget that I never managed to fully train my speaker encoder. I trained it for 1M steps but the authors of sv2tts trained it for 50M steps. You should aim for more if you can.

sberryman commented 5 years ago

Thanks @CorentinJ

Well aware Resemblyzer is your project, that is how I ended up finding it. Thanks for open sourcing that project as well. Looking forward to seeing what your next project is!

Thanks for the test script, I was thinking about how I was going to evaluate the models I'm training and would be great to compare these to your public model. Originally I was just going to plot a random 5-10 utterances for every single speaker to get an idea of the overall distribution.

Interesting on not adjusting the learning rate; I'm more accustomed to training image classification models where reducing/decaying the learning rate is almost a requirement. I will not adjust the learning rate any further then.

I was not aware the SV2TTS authors trained for 50M steps, obviously it is time for me to read their paper.

Also, this is turning into more of a discussion than an "issue". I'm happy to move it to another location or can continue using GitHub issues; completely up to you.

Thanks again!

CorentinJ commented 5 years ago

Nah it's common for issues to serve a broader purpose than just solving bugs. I don't decay the learning rate simply because it's not a necessity with Adam. The original authors did not use Adam and they did decay the learning rate by the way. Also, you will have to read GE2E to know more about the speaker encoder, because there isn't much info in SV2TTS about how they train or evaluate it.

slavaGanzin commented 5 years ago

@sberryman Shaun, would be awesome if you'll create PR. If you don't feel it's polished enough, just mark it WIP. So it wouldn't be merged, but will be just an inspiration for others :)

sberryman commented 5 years ago

@slavaGanzin I have pushed my work in progress to my own fork. There are hard coded paths and changes related to grouping all the .npy files into a single .npz for each speaker. I also use docker and volume mappings so I left the basic Dockerfile in there. I don't plan on ever submitting a PR for that branch as I'm still experimenting quite heavily. Basically, feel free to use any of the scripts as a starting point but don't count on them working out of the box.

https://github.com/sberryman/Real-Time-Voice-Cloning/tree/wip

Other updates

  1. Mixed model with 256 hidden and 768 embedding size has finally hit 1,000,000 steps. Based on feedback from @CorentinJ I'm going to let that continue training for a while longer. image

Model trained to 1,005,000 is available on my dropbox account now. https://www.dropbox.com/s/69wv21ajt6l2pag/cv_run_bak_1005000.pt?dl=0

  1. English only model is progressing VERY slowly! image
Jessicamat777 commented 5 years ago

Hi sberryman, can I know which language your trained model in dropbox.com supports on?

Jessicamat777 commented 5 years ago

I need Chinese pretrained models for project in grad school. Can you guide me on that ?

sberryman commented 5 years ago

@Jessicamat777 the models I have uploaded to drop box are all for experimentation and I have NOT trained the synthesizer or vocoder on them yet. So they will be of little value unless you wanted to use them with CorentinJ's Resemblyzer.

That being said, the models on dropbox are from the following datasets.

  1. LibriTTS (train-other-500)
  2. VoxCeleb1
  3. VoxCeleb2
  4. OpenSLR (42-44, 61-66, 69-80)
  5. VCTK
  6. Common Voice

A vast majority of the speakers are English. Based on a very tiny sampling against languages it has NOT been trained on, it doesn't appear the foreign speakers make much of a difference. That is most likely due to the unbalanced training set and extremely small number of speakers per additional language. I just wanted to see if it made a difference including foreign languages while training. Meaning the clusters for foreign languages are okay but nowhere near as well defined as English speakers.

Look at this issue where I show how my model(s) perform against the one trained by CorentinJ on Swedish and Norwegian. https://github.com/resemble-ai/Resemblyzer/issues/9

I haven't made an effort to train on Chinese but it shouldn't be difficult if you have enough data. CorentinJ has done a great job of documenting the training process and answering questions on what size dataset you would need to train from scratch.

Jessicamat777 commented 4 years ago

Thanks to reply me,

Can I use multiple GPUs to train encoder data, so as to connect and make it one at the end ? If I can save time training like this .

Please let me know ?

On Mon, 16 Sep 2019, 21:08 Shaun Berryman, notifications@github.com wrote:

@Jessicamat777 https://github.com/Jessicamat777 the models I have uploaded to drop box are all for experimentation and I have NOT trained the synthesizer or vocoder on them yet. So they will be on little value unless you wanted to use them with CorentinJ's Resemblyzer.

That being said, the models on dropbox are from the following datasets.

  1. LibriTTS https://ai.google/tools/datasets/libri-tts/ (train-other-500)
  2. VoxCeleb1 http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
  3. VoxCeleb2 http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html
  4. OpenSLR http://www.openslr.org/resources.php (42-44, 61-66, 69-80)
  5. VCTK https://datashare.is.ed.ac.uk/handle/10283/2651
  6. Common Voice https://voice.mozilla.org/en/datasets

A vast majority of the speakers are English. Based on a very tiny sampling against languages it has NOT been trained on, it doesn't appear the foreign speakers make much of a difference. That is most likely due to the unbalanced training set, I didn't make any effort to balance. Just wanted to see if it made a difference including foreign languages while training. Meaning the clusters for foreign languages are okay but nowhere near as well defined as English speakers.

Look at this issue where I show how my model(s) perform against the one trained by CorentinJ on Swedish and Norwegian. resemble-ai/Resemblyzer#9 https://github.com/resemble-ai/Resemblyzer/issues/9

I haven't made an effort to train on Chinese but it shouldn't be difficult if you have enough data. CorentinJ has done a great job of documenting the training process and answering questions on what size dataset you would need to train from scratch.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126?email_source=notifications&email_token=AM3TVRFPT46WSVI22P4M6UTQJ6SBBA5CNFSM4IUT3NSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ZSEOQ#issuecomment-531833402, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3TVRCI2RBDG37GJFUC7LTQJ6SBBANCNFSM4IUT3NSA .

sberryman commented 4 years ago

@Jessicamat777 multi-gpu training is NOT implemented. If you do implement it, can you please submit a pull-request to this repository so others can benefit?

sberryman commented 4 years ago

Training is still progressing on the mixed and english models. This is just to update anyone if they are following this issue.

Mixed

image

English

image

shawwn commented 4 years ago

Training the encoder is interesting, but I'm not entirely convinced that the problem is the encoder. (Where "problem" is defined as "the current model has a lot of trouble reproducing female voices accurately.")

Are we certain that for every possible human voice, there exists an embedding which allows tacotron2 to produce spectrograms indistinguishable from that voice?

If not, then it seems beneficial if tacotron2 were trained on the new diverse speech dataset in addition to the encoder.

For example, in my experiments it has seemed impossible to generate spectrograms with cartoon-style inflections: lots of expressive vocalizations, rapid pitch changes, and so on.

If that's how a speaker sounds normally, then it seems like it's impossible for the encoder to generate any latent vector that would cause tacotron2 to produce spectrograms that sound anything like the speaker.

Perhaps I am confused, but just to confirm: there are three separate things that need to be trained, right? The encoder, the synthesizer (text to spectrogram), and the vocoder (spectrogram to wav). This training process is focusing entirely on the encoder. How is the loss being calculated? If the loss is calculated in terms of "tacotron2 is able to generate spectrograms that sound more like this speaker," then the training here will not have a huge impact on overall quality or diversity. The training would need to be done on the synth, then the encoder.

Do I have this backwards? Is it true that the encoder's final quality is bounded by the expressiveness of the synth? If that's correct, then the synth is what would benefit from the larger dataset.

CorentinJ commented 4 years ago

Training the encoder is interesting, but I'm not entirely convinced that the problem is the encoder. (Where "problem" is defined as "the current model has a lot of trouble reproducing female voices accurately.")

It's not intuitive, I agree. However, this is clearly the conclusion the authors of the sv2tts paper reached. They argue that most of the ability to clone voices lies in the training of the encoder. They also clearly show that the framework has limitations (which we observe in this repo as well):

An additional limitation lies in the model’s inability to transfer accents. Given sufficient training data, this could be addressed by conditioning the synthesizer on independent speaker and accent embeddings. Finally, we note that the model is also not able to completely isolate the speaker voice from the prosody of the reference audio, ...

If you give a listen to their librispeech samples, you will notice that as well.

sberryman commented 4 years ago

Training updates

Encoder

I've stopped training both the mixed and English encoders, the mixed encoder reached just over 2.1 million steps with 27,432 speakers.

Synthesizer

Since I'm using LibriTTS I had to make some changes to the code base. First I used Montreal forced aligner to come up with the alignments. Then I realized google already normalized the audio and removed the leading and trailing silence. So at this point I just skipped the alignment portion of preprocessing and use the original transcript (as opposed to the normalized which is also provided) with all punctuation and capitalization left in place. I know the English cleaner converts everything to lowercase though.

I started training last night across two GTX 1080 Ti's and GPU utilization bounces between 20% and 93%.

Overridden hparams:

Training progress

TensorBoard

192 168 7 171_6006_ image image

Stdout

Step   27753 [1.664 sec/step, loss=0.68117, avg_loss=0.67622]
Step   27754 [1.690 sec/step, loss=0.64809, avg_loss=0.67585]
Step   27755 [1.687 sec/step, loss=0.68754, avg_loss=0.67603]
Step   27756 [1.686 sec/step, loss=0.67575, avg_loss=0.67593]
Step   27757 [1.675 sec/step, loss=0.65758, avg_loss=0.67573]
Step   27758 [1.684 sec/step, loss=0.66391, avg_loss=0.67550]
Step   27759 [1.687 sec/step, loss=0.66689, avg_loss=0.67528]
Step   27760 [1.710 sec/step, loss=0.66279, avg_loss=0.67525]
Step   27761 [1.681 sec/step, loss=0.69119, avg_loss=0.67565]
Step   27762 [1.679 sec/step, loss=0.67129, avg_loss=0.67552]
Step   27763 [1.677 sec/step, loss=0.69174, avg_loss=0.67563]
Step   27764 [1.693 sec/step, loss=0.65657, avg_loss=0.67544]
Step   27765 [1.692 sec/step, loss=0.66381, avg_loss=0.67518]
Step   27766 [1.672 sec/step, loss=0.70290, avg_loss=0.67546]

Plots

step-22000-align step-22000-mel-spectrogram step-24000-align step-24000-mel-spectrogram step-26000-align step-26000-mel-spectrogram

WAVs

wavs.zip

Questions:

  1. Is it normal for the max_gradient_norm, stop_token_loss and regularization_loss to be increasing? Basically, do the tensorboard plots look okay?
  2. How many steps did you train the synthesizer?
  3. How many steps did you train the vocoder?
CorentinJ commented 4 years ago

I don't know about tensorboard, I didn't use it back then. As for the number of steps, you can check the pretrained models page.

sberryman commented 4 years ago

Thanks @CorentinJ, somehow I've missed the pretrained Wiki page. FYI, I still plan on figuring out fatchord/WaveRNN but I wanted a baseline version using your codebase.

This is a fun exercise, thanks for your patience!

sberryman commented 4 years ago

Synthesizer training is ongoing but I'm running into the same issues @CorentinJ ran into with LibriTTS where it fails to align. Since I skipped splitting on silence and noise reduction code I guess I'm not too surprised. What I am wondering is what is the impact of failing to align? The spectrograms and the wav files generated while training are easily distinguishable.

Edit: Since it is failing to align, is it worth training the vocoder or would you suggest I continue training the synthesizer for a few more days/week to see if it improves?

Mixed language encoder

step-192000-align step-192000-mel-spectrogram step-194000-align step-194000-mel-spectrogram step-196000-align step-196000-mel-spectrogram

English only encoder

step-190000-align step-190000-mel-spectrogram step-192000-align step-192000-mel-spectrogram step-194000-align step-194000-mel-spectrogram

sberryman commented 4 years ago

Training update

Synthesizer

I've stopped training synthesizers for both the English and mixed datasets.

Vocoder

Started training a vocoder for each of the synthesizer models using the default hyper parameters with the following overrides:

Mixed

mixed.zip

Stdout:

{| Epoch: 1 (1158/1158) | Loss: 4.6526 | 1.4 steps/s | Step: 1k | }
{| Epoch: 2 (1158/1158) | Loss: 4.1365 | 1.4 steps/s | Step: 2k | }
{| Epoch: 3 (1158/1158) | Loss: 4.0376 | 1.4 steps/s | Step: 3k | }
...
{| Epoch: 75 (1158/1158) | Loss: 3.6903 | 1.4 steps/s | Step: 86k | }
{| Epoch: 76 (1158/1158) | Loss: 3.6877 | 1.4 steps/s | Step: 88k | }
{| Epoch: 77 (1158/1158) | Loss: 3.6839 | 1.4 steps/s | Step: 89k | }

Included files:

English

english.zip

Stdout:

{| Epoch: 1 (1808/1808) | Loss: 4.5359 | 1.4 steps/s | Step: 1k | }
{| Epoch: 2 (1808/1808) | Loss: 4.0721 | 1.4 steps/s | Step: 3k | }
{| Epoch: 3 (1808/1808) | Loss: 3.9830 | 1.4 steps/s | Step: 5k | }
...
{| Epoch: 30 (1808/1808) | Loss: 3.7225 | 1.4 steps/s | Step: 54k | }
{| Epoch: 31 (1808/1808) | Loss: 3.7228 | 1.4 steps/s | Step: 56k | }
{| Epoch: 32 (1808/1808) | Loss: 3.7173 | 1.4 steps/s | Step: 57k | }

Included files:

Overall I would say the vocoders are starting to sound okay and it appears they are working without the synthesizer aligning. According to the Pretrained models you trained the vocoder for 428k steps. I'll let these two models train until a similar target number of steps.

sberryman commented 4 years ago

Loss is still high at 3.6495 on the english model and 3.6416 on mixed. However, the quality is improving quite a bit. There are a few examples of generated audio that sounds just as good if not better than the original.

Based on generated examples while training, both models (from my perspective) do a better job on male than female speakers. While some of the generated audio sounds excellent, there are quite a few that have artifacts (pops, high-pitched, static, etc).

If anyone wants to listen to more generated examples, I will be happy to share them.

frossi65 commented 4 years ago

@sberryman is the italian language included in this pretrained model?

sberryman commented 4 years ago

@frossi65 The Italian language is only used as part of the encoder training. I did NOT use Italian as part of the synthesizer or vocoder training.

frossi65 commented 4 years ago

@sberryman thanks for your quick answer.

shawwn commented 4 years ago

@sberryman I'd be interested in hearing more samples.

In my experience, target=16000 overlap=800 produces high quality pop-free audio. I used it to make Dr. Kleiner sing: https://www.reddit.com/r/HalfLife/comments/d2rzf0/deepfaked_dr_kleiner_sings_i_am_the_very_model_of/

sberryman commented 4 years ago

@shawwn I've attached the mixed and English results. Personally the mixed sounds better but I'm not convinced this is a very good model as the loss is very high. Not sure if @CorentinJ has an opinion on the loss, maybe that is expected? I'm assuming the loss is quite high as the synthesizer never managed to align.

Mixed

mixed_330_331k.zip

English

english_300_301k.zip

CorentinJ commented 4 years ago

Sorry but I can't quite remember what the loss was like when I trained the models. You could try to continue the training with my model and see what gives. The raw of value of the loss itself doesn't hold much meaning until you manage to compare it to a baseline.

shawwn commented 4 years ago

@sberryman Would you be willing to upload your current encoder, synth, and vocoder models? Even if it's not finished training yet, I'd like to experiment with them.

Bonus points if you upload the tensorboard logs too :)

The samples sound promising!

sberryman commented 4 years ago

@shawwn I've uploaded the models to my dropbox. The vocoder is still training and will be for another 24-48 hours. Please share whatever you end up making with them!

Encoder

https://www.dropbox.com/s/xl2wr13nza10850/encoder.zip?dl=0

Synthesizer (Tacotron)

https://www.dropbox.com/s/t7qk0aecpps7842/tacotron.zip?dl=0

Vocoder

https://www.dropbox.com/s/bgzeaid0nuh7val/vocoder.zip?dl=0

sberryman commented 4 years ago

@shawwn - Have you tried the models yet? I was just doing some testing and every voice I tried to clone sounded the same. Wondering if you experienced the same? (They all sounded robotic and female)

My assumption is the synthesizer and vocoder didn't train properly as I'm able to cluster voices using the encoder.

ViktorAlm commented 4 years ago

I've had similar problems with the voice when i've used the wrong encoder for the synthesizer. My test run where i only did a few steps on each model was able to produce different voices atleast in the direction of the encoded voice. I dont have physical or remote access to my machine atm so cant see exactly what you mean.

I did some more data preprocesssing and my swedish model seems to work alot better on male voices now. Have not trained the vocoder yet on this run just tried the synthesizer on griffin-lim.

sberryman commented 4 years ago

Thanks for the feedback @ViktorAlm! I went back and double checked that I was using the correct encoder, synthesizer and vocoder for each path I'm training and they all sounded the same. It was only a quick test using demo_cli.py

What preprocessing did you do to help train the synthesizer?

Tiege95 commented 4 years ago

@sberryman Hey, I have an idea that might make it possible to add more voices for training. A possibly large, untapped "dataset" is voice files ripped from video games. The Sounds Resource is probably one of the largest repositories of video game sound effects. You can just specially look up files for character dialogue, most of which is clean audio recorded in a studio (this mainly applies to video games made within the last 20 years).

The only limitation is that these voice clips would only be useful for the encoder since they unfortunately have no alignments. The upside is that there's a large variety of speakers, accents, and even options for Japanese dialogue if the game is from Japan. Most games made in the English-speaking world probably fall under EFIGS (English, French, Italian, German, Spanish) if they have been localized in Europe, so there might be options for those languages as well.

AAA games have the largest amount of voice actors, so it may be of interest to look into games like Skyrim, Fallout 4, GTA V, etc. since there's a large amount of NPC character dialogue.

Also, here are some links that may be helpful for finding new datasets: https://www.cmswire.com/digital-asset-management/9-voice-datasets-you-should-know-about/ https://towardsdatascience.com/a-data-lakes-worth-of-audio-datasets-b45b88cd4ad https://lionbridge.ai/datasets/12-best-audio-datasets-for-machine-learning/ https://skymind.ai/wiki/open-datasets https://voices18.github.io/

An interesting dataset that I found recently is The Spoken Wikipedia Corpora.

sberryman commented 4 years ago

@Tiege95 Thanks so much for sharing; I'll be checking these sources out this evening! Have you attempted to train a model? Would really like to hear others experience on what worked or didn't work.

Tiege95 commented 4 years ago

@sberryman I'm currently unable to experiment with this program since I don't have a computer with the proper specs to run it, but I love reading up on this kind of stuff. I figured that The Sounds Resource, while it's not technically not a dataset made for machine learning applications, is a huge resource of voice recordings. The PC/Computer section alone has ~1000 games to download sounds from (Overwatch, Dragon Ball Xenoverse, Half-Life, etc.). Voice files usually just have the corresponding character's name or are listed under something like "Cutscene Voices".

sberryman commented 4 years ago

@Tiege95 sorry for the 2+ week delay, somehow I missed your message. Any chance you've written a script to download all the voice/speech files from Sounds Resource? I was looking through it today and definitely a lot of clean audio from game characters.

On a side note, I got the flu and decided to let the English model keep training while stuck in bed. That model is up to over 1.5 million steps now. (768/768 embedding/hidden size and 17,688 speakers) This model has been training for almost 28 days now.

Then for fun I decided to start training a 1024/1024 model with same 17,688 English speakers and the remaining 9,744 mix of other languages. With a single 1080 TI training the large embedding model it is taking quite a long time. Up to 379k steps over ~7 days of training. The graph isn't complete due to a 12+ hour power outage.

English

image

Mixed

image

Tiege95 commented 4 years ago

@sberryman Sorry, I don't have a script for that site.