Closed sberryman closed 4 years ago
Great work and great questions! I'll pin this issue for others in need of help.
Firstly, one thing I notice from your profiler output is that you would benefit from a 2x speedup by putting your data on a faster disk (or maybe increasing the number of threads in the DataLoader if you set them too low)
Thanks for the quick reply!
I also noticed the blocking operation taking a long time, found it very strange as the mel spectrograms are stored on a Samsung 960 EVO 1TB NVMe drive and SpeakerVerificationDataLoader
has num_workers=16
CPU bounces around from about 50-80% utilization and disk is showing 4-18% busy. nvidia-smi is showing low utilization. Maybe I completely glossed over the code where you are reading from the wav audio files during training? That would explain it as wav's are sitting on a slow spinning disk.
model_embedding_size = 768
in https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder/params_model.py#L4. Would you adjust the model_hidden_size
or any other parameters?Edit: The other thing I thought about for speeding up IO would be stacking the numpy files for each speaker into a single file as sequential reading is much faster. I would only have to open 10 files per step vs 100. I have plenty of memory in my computer I'm using for training so maybe that wont be an optimization many others could benefit from?
Edit 2:
I've gone through all the numpy files for each speaker and saved them into a combined file using np.savez
and adjusted the code in encoder/data_objects/speaker.py
and encoder/data_objects/utterance.py
I'm now getting a much more consistent and lower load time for the data. Obviously increasing the embedding size from 256 to 768 has almost tripled the backward pass duration. Funny enough my overall step time has remained about the same but the embedding size tripled. So I consider that a win!
Step 1030 Loss: 3.2002 EER: 0.2662 Step time: mean: 871ms std: 58ms
Average execution time over 10 steps:
Blocking, waiting for batch (threaded) (10/10): mean: 103ms std: 26ms
Data to cuda (10/10): mean: 3ms std: 0ms
Forward pass (10/10): mean: 7ms std: 1ms
Loss (10/10): mean: 73ms std: 3ms
Backward pass (10/10): mean: 569ms std: 67ms
Parameter update (10/10): mean: 116ms std: 3ms
Extras (visualizations, saving) (10/10): mean: 1ms std: 4ms
Edit 3: I wasn't happy with the backward pass duration so I made the backwards pass run on the GPU. This is what I'm looking at now...
Step 310 Loss: 3.6576 EER: 0.3275 Step time: mean: 425ms std: 233ms
Average execution time over 10 steps:
Blocking, waiting for batch (threaded) (10/10): mean: 104ms std: 122ms
Data to cuda (10/10): mean: 3ms std: 0ms
Forward pass (10/10): mean: 39ms std: 1ms
Loss (10/10): mean: 23ms std: 1ms
Backward pass (10/10): mean: 80ms std: 5ms
Parameter update (10/10): mean: 121ms std: 2ms
Extras (visualizations, saving) (10/10): mean: 1ms std: 3ms
..........
Step 320 Loss: 3.6723 EER: 0.3339 Step time: mean: 322ms std: 98ms
Average execution time over 10 steps:
Blocking, waiting for batch (threaded) (10/10): mean: 60ms std: 97ms
Data to cuda (10/10): mean: 3ms std: 0ms
Forward pass (10/10): mean: 39ms std: 0ms
Loss (10/10): mean: 22ms std: 1ms
Backward pass (10/10): mean: 77ms std: 4ms
Parameter update (10/10): mean: 121ms std: 2ms
Extras (visualizations, saving) (10/10): mean: 2ms std: 4ms
..........
Step 330 Loss: 3.6419 EER: 0.3309 Step time: mean: 362ms std: 140ms
Average execution time over 10 steps:
Blocking, waiting for batch (threaded) (10/10): mean: 97ms std: 139ms
Data to cuda (10/10): mean: 3ms std: 0ms
Forward pass (10/10): mean: 39ms std: 1ms
Loss (10/10): mean: 24ms std: 3ms
Backward pass (10/10): mean: 78ms std: 4ms
Parameter update (10/10): mean: 121ms std: 1ms
Extras (visualizations, saving) (10/10): mean: 1ms std: 3ms
thank you
There are quite a few ways to gain disk reading speedups for the encoder, but don't forget that you still need variety in the samples/batches. Another bottleneck is the GPU VRAM not being entirely used. Since the complexity of the forward/backward pass is cubic w.r.t the batch size, you would need to put multiple batches in parallel on the same GPU rather than putting a larger batch size. It's something worth looking into.
I had no idea you could specify to run the backward pass on the gpu, how did you do that?
Thanks for the continuous feedback.
## Model parameters:
learning_rate_init: 0.0001
model_embedding_size: 768
model_hidden_size: 256
model_num_layers: 3
speakers_per_batch: 64
utterances_per_speaker: 10
## Data parameters:
audio_norm_target_dBFS: -30
inference_n_frames: 80
mel_n_channels: 40
mel_window_length: 25
mel_window_step: 10
partials_n_frames: 160
sampling_rate: 16000
vad_max_silence_length: 6
vad_moving_average_width: 8
vad_window_length: 30
I trained with ~9,000 speakers (mixed languages but mostly English) through step 352,600 and included the UMAP projections for that below. I then remembered the Common Voice project from Mozilla and downloaded the entire thing. Then I placed all the individual speakers into unique folders and pruned all the speakers that didn't have 10 or more utterances. I then resumed training with the combined datasets bringing the total speakers to 25,668.
Thanks but I'll hold off on changing sample rate for now, already adjusting a lot.
I didn't download them from YouTube, they are available for download from TED.com at https://www.ted.com/talks/quick-list?page=1 and the alignments match TEDLIUM-3. The transcripts available from TED are of higher quality than the ones in TEDLIUM-3 dataset but alignments don't match due to the TED splash screen/banner that plays in the beginning.
Sounds good, Fatchord's version it is! Perfect timing as another person using this repository (@TheButlah) has just made a lot of improvements and included multi-gpu training.
The combined npz files have been working great for me, it will load all the utterances per speaker and still uses your same sampling code to grab a random sample per speaker. The only thing I removed is loading from individual npy files.
I assume I changed the backwards pass to GPU, either way the GPU utilization is much higher and the profiler is showing significantly lower mean duration's for "Backward pass". I changed the loss_device
to run on the GPU.
Then on https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/master/encoder/model.py#L27-L28
self.similarity_weight = nn.Parameter(torch.tensor([10.]).to(loss_device))
self.similarity_bias = nn.Parameter(torch.tensor([-5.]).to(loss_device))
Simply moved the tensor not the parameter to the GPU and changed the GPU sync in train.py
to:
def sync(device: torch.device):
# FIXME
# return
# For correct profiling (cuda operations are async)
if device.type == "cuda":
# torch.cuda.synchronize(device)
torch.cuda.synchronize()
I'm now up to step 447,200 and included the loss and UMAP to show progress. I also changed the UMAP visualization to show 30 speakers by adding more colors to the color map.
colormap = np.array([
[32, 25, 35],
[255, 255, 255],
[252, 255, 93],
[125, 252, 0],
[14, 196, 52],
[34, 140, 104],
[138, 216, 232],
[35, 91, 84],
[41, 189, 171],
[57, 152, 245],
[55, 41, 79],
[39, 125, 167],
[55, 80, 219],
[242, 32, 32],
[153, 25, 25],
[255, 203, 165],
[230, 143, 102],
[197, 97, 51],
[150, 52, 28],
[99, 40, 25],
[255, 196, 19],
[244, 122, 34],
[47, 42, 160],
[183, 50, 204],
[119, 43, 157],
[240, 124, 171],
[211, 11, 148],
[237, 239, 243],
[195, 165, 180],
[148, 106, 162],
[93, 76, 134],
[0, 0, 0],
[183, 183, 183],
], dtype=np.float) / 255
Ah, I had put a warning not to compute the loss on GPU because for some reason it wasn't working (either it was some intricacies with torch or I forgot to enable grad on some tensor) and would return None. If that works, then I should update the repo to make it the default and have only 1 device for the encoder.
You are correct, it was not working until I changed the two lines to move the tensor to the GPU not the parameter. That was all I had to change (I believe, if not I can dig through all my changes and help you isolate that fix.) Technically I changed loss_device to loss_device = device
just so I didn't miss anything in train.py
. Either way, only one GPU is exposed to my docker container used for training.
Also in the sync function, I had to remove the device parameter and simply use torch.cuda.synchronize()
Clusters are getting tighter but I plan on training until at least 700-900k steps. I'm also tempted to train an English only model to compare.
@sberryman will you be submitting a pull request? Id be very interested to see the results using more data for the speaker encoder - the GE2E paper demonstrated that having more data for the encoder is critical to getting the similarity of the cloned speaker close to the original.
Also in my own experience, the compatibility of Fatchords Taco1 with WaveRNN makes it a great candidate, and the codebase is easy to work with. I still believe that Taco2 would be an upgrade in terms of quality of the inflection of the speaker, but that the out of the box compatibility of Fatchords synthesizer with the vocoder makes it a natural choice.
Do note that Fatchords synthesizer does not support multiple speakers, so you would need to add that capability yourself (and a PR on Fatchords repo would be especially appreciated for adding that capability :) )
I'm also very interested in the results. I'm currently training the encoder on about 2k speakers in Swedish and about 4k mixed mainly English. I would really like to see examples from your encoder model on multiple languages to see if its worth crawling radio and tv shows with resemblyzers diarization to create a a fully Swedish dataset or if 6k with 1/3 being Swedish can compare to 25k mixed mainly english for Swedish voice cloning. My hunch is m0ar data
I'm at ~700k steps and still quite a few tight clusters, not sure if this is due to the fact that I trained for 350k steps on 9,000 speakers prior to adding 16,668 more speakers (which also introduced quite a few more languages) I'm going to continue training for another 200k steps which will be done this time tomorrow morning.
First, thanks for the massive PR that landed on Fatchords WaveRNN 4 days ago, really excited you added mutli-gpu training and mel's in numpy format! To your question on a PR, I can certainly submit PRs to this repo and WaveRNN. The code to utilize most of the datasets from OpenSLR and Common Voice are bit of a hack but if people want them I'm open to working on a PR for that as well.
Thanks for the feedback on Taco1 and WaveRNN from Fatchords repo, that will be the route I will go. I will most likely run into issues adding multi-speaker but I will start an issue in that repo when I get there.
Great to hear about someone else testing multiple languages! Have you changed any of the data or model parameters? Funny you mentioned using Resemble's diarization as I've had a tab open to that code for a few days and planned on using it against 7,000 hours of local (English) news video I have. That is once I finished training a new model.
As far as sharing the models I'm training, I'm open to it. Here is the model trained to 697,500 steps (768 model embedding size and 256 hidden layers.) https://www.dropbox.com/s/2b5g2rt4vypx9qq/cv_run_bak_697500.pt?dl=0
Would be interested to know how it performs against your Swedish data @ViktorAlm.
Thanks! I have not changed any params. I was on step 150k with my data to try and do a real run with all the models. I did one where I only did 100k steps on each model with about 900 swedish speakers with about 90gb data in total. It did not clone the voice but produced a good audio quality and atleast a male voice came out when I ran my own voice. I paused it and did a quick test with yours and the encoding result is way better than the small testrun I did.
Swedish and Norwegian are pretty similar. I didnt see any specific Swedish/Norwegian cluster gathering but I only did two tests and umap might remove any visible difference I guess.
Heres a converter if you wish to add norwegian, danish and swedish data to your mix: https://github.com/ViktorAlm/Nasjonalbank-converter
I also added some results from your encoder in /Results.
When i've played around a bit more i might make a script that evaluates different languages better.
@ViktorAlm Thanks for sharing!
Is your Swedish and Norwegian dataset private? I'm up for including those speakers in the next training run where I use 768 for hidden/embedding size if you can share. There are only 20 Swedish voices in the 25,668 speakers I am training on and zero Norwegian. Common voice had 44 speakers for Swedish but I filtered those down to 20 as I had a floor of 12 unique utterances per speaker.
If anyone else is aware of other datasets I can include please let me know!
Nice!
I edited my old comment because i did not want to clutter your thread with my bad screenshots. I added my converter with links to the datasets. Its very hacky and if you want to add them i really should clean up the code some. I think a simple merge of the folders and then looping through to get the spls(files with info on location etc) and loading the files would be the best way instead of my weird way of scanning the folders. I was testing on just one of the extracted folders and the speech folders did not contain the wavs which was specified in the spl file. Then everything went weird from there.
Just in case this wasn't clear, Resemblyzer is also my project and is merely an interface to the speaker encoder of this repo. You can replace the pretrained model in the package and put yours instead. I could also distribute models that you provide me for other languages.
I also would like to leave my script for evaluating the EER over the test set. It's not clean and I'm not sure if it's correct either (given that you won't find anywhere the right procedure to evaluate the EER over a dataset). You should use this if you want to formally evaluate the performance of the speaker encoder.
If someone manages to make it better then I would gladly include it in the repo
from encoder.data_objects import SpeakerVerificationDataLoader, SpeakerVerificationDataset
from encoder.model import SpeakerEncoder
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
import torch
# This is my script for computing the test EER.
dataset_root = r"E:\Datasets\SV2TTS\encoder_test"
if __name__ == '__main__':
speakers_per_batch = 32
steps = 100
dataset = SpeakerVerificationDataset(Path(dataset_root))
model = SpeakerEncoder(torch.device("cuda"), torch.device("cpu"))
checkpoint = torch.load("saved_models/pretrained.pt")
model.load_state_dict(checkpoint["model_state"])
model.eval()
results = []
for utterances_per_speaker in range(6, 8):
loader = SpeakerVerificationDataLoader(
dataset,
speakers_per_batch=speakers_per_batch,
utterances_per_speaker=utterances_per_speaker,
num_workers=8,
)
with torch.no_grad():
eers = []
for step, speaker_batch in zip(range(1, steps + 1), loader):
inputs = torch.from_numpy(speaker_batch.data).cuda()
embeds = model(inputs)
embeds_loss = embeds.view((speakers_per_batch, utterances_per_speaker, -1)).cpu()
_, eer = model.loss(embeds_loss)
eers.append(eer)
print("Step %d EER: %.3f" % (step, np.mean(eers)))
results.append(np.mean(eers))
plt.plot(range(2, 11), results)
plt.xlabel("Enrollment utterances")
plt.ylabel("Equal Error Rate")
plt.show()
Also I don't know about that:
I've reduced the learning rate from 1e-4 to 1e-5 on the mixed dataset which seems to help. I'll probably drop it down to 1e-6 around step 800-850k.
I've left my lr to 1e-4 all along, I think you should be fine with that same value as well
Don't forget that I never managed to fully train my speaker encoder. I trained it for 1M steps but the authors of sv2tts trained it for 50M steps. You should aim for more if you can.
Thanks @CorentinJ
Well aware Resemblyzer is your project, that is how I ended up finding it. Thanks for open sourcing that project as well. Looking forward to seeing what your next project is!
Thanks for the test script, I was thinking about how I was going to evaluate the models I'm training and would be great to compare these to your public model. Originally I was just going to plot a random 5-10 utterances for every single speaker to get an idea of the overall distribution.
Interesting on not adjusting the learning rate; I'm more accustomed to training image classification models where reducing/decaying the learning rate is almost a requirement. I will not adjust the learning rate any further then.
I was not aware the SV2TTS authors trained for 50M steps, obviously it is time for me to read their paper.
Also, this is turning into more of a discussion than an "issue". I'm happy to move it to another location or can continue using GitHub issues; completely up to you.
Thanks again!
Nah it's common for issues to serve a broader purpose than just solving bugs. I don't decay the learning rate simply because it's not a necessity with Adam. The original authors did not use Adam and they did decay the learning rate by the way. Also, you will have to read GE2E to know more about the speaker encoder, because there isn't much info in SV2TTS about how they train or evaluate it.
@sberryman Shaun, would be awesome if you'll create PR. If you don't feel it's polished enough, just mark it WIP. So it wouldn't be merged, but will be just an inspiration for others :)
@slavaGanzin I have pushed my work in progress to my own fork. There are hard coded paths and changes related to grouping all the .npy files into a single .npz for each speaker. I also use docker and volume mappings so I left the basic Dockerfile in there. I don't plan on ever submitting a PR for that branch as I'm still experimenting quite heavily. Basically, feel free to use any of the scripts as a starting point but don't count on them working out of the box.
https://github.com/sberryman/Real-Time-Voice-Cloning/tree/wip
Model trained to 1,005,000 is available on my dropbox account now. https://www.dropbox.com/s/69wv21ajt6l2pag/cv_run_bak_1005000.pt?dl=0
Hi sberryman, can I know which language your trained model in dropbox.com supports on?
I need Chinese pretrained models for project in grad school. Can you guide me on that ?
@Jessicamat777 the models I have uploaded to drop box are all for experimentation and I have NOT trained the synthesizer or vocoder on them yet. So they will be of little value unless you wanted to use them with CorentinJ's Resemblyzer.
That being said, the models on dropbox are from the following datasets.
A vast majority of the speakers are English. Based on a very tiny sampling against languages it has NOT been trained on, it doesn't appear the foreign speakers make much of a difference. That is most likely due to the unbalanced training set and extremely small number of speakers per additional language. I just wanted to see if it made a difference including foreign languages while training. Meaning the clusters for foreign languages are okay but nowhere near as well defined as English speakers.
Look at this issue where I show how my model(s) perform against the one trained by CorentinJ on Swedish and Norwegian. https://github.com/resemble-ai/Resemblyzer/issues/9
I haven't made an effort to train on Chinese but it shouldn't be difficult if you have enough data. CorentinJ has done a great job of documenting the training process and answering questions on what size dataset you would need to train from scratch.
Thanks to reply me,
Can I use multiple GPUs to train encoder data, so as to connect and make it one at the end ? If I can save time training like this .
Please let me know ?
On Mon, 16 Sep 2019, 21:08 Shaun Berryman, notifications@github.com wrote:
@Jessicamat777 https://github.com/Jessicamat777 the models I have uploaded to drop box are all for experimentation and I have NOT trained the synthesizer or vocoder on them yet. So they will be on little value unless you wanted to use them with CorentinJ's Resemblyzer.
That being said, the models on dropbox are from the following datasets.
- LibriTTS https://ai.google/tools/datasets/libri-tts/ (train-other-500)
- VoxCeleb1 http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
- VoxCeleb2 http://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox2.html
- OpenSLR http://www.openslr.org/resources.php (42-44, 61-66, 69-80)
- VCTK https://datashare.is.ed.ac.uk/handle/10283/2651
- Common Voice https://voice.mozilla.org/en/datasets
A vast majority of the speakers are English. Based on a very tiny sampling against languages it has NOT been trained on, it doesn't appear the foreign speakers make much of a difference. That is most likely due to the unbalanced training set, I didn't make any effort to balance. Just wanted to see if it made a difference including foreign languages while training. Meaning the clusters for foreign languages are okay but nowhere near as well defined as English speakers.
Look at this issue where I show how my model(s) perform against the one trained by CorentinJ on Swedish and Norwegian. resemble-ai/Resemblyzer#9 https://github.com/resemble-ai/Resemblyzer/issues/9
I haven't made an effort to train on Chinese but it shouldn't be difficult if you have enough data. CorentinJ has done a great job of documenting the training process and answering questions on what size dataset you would need to train from scratch.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/126?email_source=notifications&email_token=AM3TVRFPT46WSVI22P4M6UTQJ6SBBA5CNFSM4IUT3NSKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6ZSEOQ#issuecomment-531833402, or mute the thread https://github.com/notifications/unsubscribe-auth/AM3TVRCI2RBDG37GJFUC7LTQJ6SBBANCNFSM4IUT3NSA .
@Jessicamat777 multi-gpu training is NOT implemented. If you do implement it, can you please submit a pull-request to this repository so others can benefit?
Training is still progressing on the mixed and english models. This is just to update anyone if they are following this issue.
Training the encoder is interesting, but I'm not entirely convinced that the problem is the encoder. (Where "problem" is defined as "the current model has a lot of trouble reproducing female voices accurately.")
Are we certain that for every possible human voice, there exists an embedding which allows tacotron2 to produce spectrograms indistinguishable from that voice?
If not, then it seems beneficial if tacotron2 were trained on the new diverse speech dataset in addition to the encoder.
For example, in my experiments it has seemed impossible to generate spectrograms with cartoon-style inflections: lots of expressive vocalizations, rapid pitch changes, and so on.
If that's how a speaker sounds normally, then it seems like it's impossible for the encoder to generate any latent vector that would cause tacotron2 to produce spectrograms that sound anything like the speaker.
Perhaps I am confused, but just to confirm: there are three separate things that need to be trained, right? The encoder, the synthesizer (text to spectrogram), and the vocoder (spectrogram to wav). This training process is focusing entirely on the encoder. How is the loss being calculated? If the loss is calculated in terms of "tacotron2 is able to generate spectrograms that sound more like this speaker," then the training here will not have a huge impact on overall quality or diversity. The training would need to be done on the synth, then the encoder.
Do I have this backwards? Is it true that the encoder's final quality is bounded by the expressiveness of the synth? If that's correct, then the synth is what would benefit from the larger dataset.
Training the encoder is interesting, but I'm not entirely convinced that the problem is the encoder. (Where "problem" is defined as "the current model has a lot of trouble reproducing female voices accurately.")
It's not intuitive, I agree. However, this is clearly the conclusion the authors of the sv2tts paper reached. They argue that most of the ability to clone voices lies in the training of the encoder. They also clearly show that the framework has limitations (which we observe in this repo as well):
An additional limitation lies in the model’s inability to transfer accents. Given sufficient training data, this could be addressed by conditioning the synthesizer on independent speaker and accent embeddings. Finally, we note that the model is also not able to completely isolate the speaker voice from the prosody of the reference audio, ...
If you give a listen to their librispeech samples, you will notice that as well.
Training updates
I've stopped training both the mixed and English encoders, the mixed encoder reached just over 2.1 million steps with 27,432 speakers.
Since I'm using LibriTTS I had to make some changes to the code base. First I used Montreal forced aligner to come up with the alignments. Then I realized google already normalized the audio and removed the leading and trailing silence. So at this point I just skipped the alignment portion of preprocessing and use the original transcript (as opposed to the normalized which is also provided) with all punctuation and capitalization left in place. I know the English cleaner converts everything to lowercase though.
I started training last night across two GTX 1080 Ti's and GPU utilization bounces between 20% and 93%.
Step 27753 [1.664 sec/step, loss=0.68117, avg_loss=0.67622]
Step 27754 [1.690 sec/step, loss=0.64809, avg_loss=0.67585]
Step 27755 [1.687 sec/step, loss=0.68754, avg_loss=0.67603]
Step 27756 [1.686 sec/step, loss=0.67575, avg_loss=0.67593]
Step 27757 [1.675 sec/step, loss=0.65758, avg_loss=0.67573]
Step 27758 [1.684 sec/step, loss=0.66391, avg_loss=0.67550]
Step 27759 [1.687 sec/step, loss=0.66689, avg_loss=0.67528]
Step 27760 [1.710 sec/step, loss=0.66279, avg_loss=0.67525]
Step 27761 [1.681 sec/step, loss=0.69119, avg_loss=0.67565]
Step 27762 [1.679 sec/step, loss=0.67129, avg_loss=0.67552]
Step 27763 [1.677 sec/step, loss=0.69174, avg_loss=0.67563]
Step 27764 [1.693 sec/step, loss=0.65657, avg_loss=0.67544]
Step 27765 [1.692 sec/step, loss=0.66381, avg_loss=0.67518]
Step 27766 [1.672 sec/step, loss=0.70290, avg_loss=0.67546]
max_gradient_norm
, stop_token_loss
and regularization_loss
to be increasing? Basically, do the tensorboard plots look okay?I don't know about tensorboard, I didn't use it back then. As for the number of steps, you can check the pretrained models page.
Thanks @CorentinJ, somehow I've missed the pretrained Wiki page. FYI, I still plan on figuring out fatchord/WaveRNN but I wanted a baseline version using your codebase.
This is a fun exercise, thanks for your patience!
Synthesizer training is ongoing but I'm running into the same issues @CorentinJ ran into with LibriTTS where it fails to align. Since I skipped splitting on silence and noise reduction code I guess I'm not too surprised. What I am wondering is what is the impact of failing to align? The spectrograms and the wav files generated while training are easily distinguishable.
Edit: Since it is failing to align, is it worth training the vocoder or would you suggest I continue training the synthesizer for a few more days/week to see if it improves?
Training update
I've stopped training synthesizers for both the English and mixed datasets.
Started training a vocoder for each of the synthesizer models using the default hyper parameters with the following overrides:
Stdout:
{| Epoch: 1 (1158/1158) | Loss: 4.6526 | 1.4 steps/s | Step: 1k | }
{| Epoch: 2 (1158/1158) | Loss: 4.1365 | 1.4 steps/s | Step: 2k | }
{| Epoch: 3 (1158/1158) | Loss: 4.0376 | 1.4 steps/s | Step: 3k | }
...
{| Epoch: 75 (1158/1158) | Loss: 3.6903 | 1.4 steps/s | Step: 86k | }
{| Epoch: 76 (1158/1158) | Loss: 3.6877 | 1.4 steps/s | Step: 88k | }
{| Epoch: 77 (1158/1158) | Loss: 3.6839 | 1.4 steps/s | Step: 89k | }
Included files:
Stdout:
{| Epoch: 1 (1808/1808) | Loss: 4.5359 | 1.4 steps/s | Step: 1k | }
{| Epoch: 2 (1808/1808) | Loss: 4.0721 | 1.4 steps/s | Step: 3k | }
{| Epoch: 3 (1808/1808) | Loss: 3.9830 | 1.4 steps/s | Step: 5k | }
...
{| Epoch: 30 (1808/1808) | Loss: 3.7225 | 1.4 steps/s | Step: 54k | }
{| Epoch: 31 (1808/1808) | Loss: 3.7228 | 1.4 steps/s | Step: 56k | }
{| Epoch: 32 (1808/1808) | Loss: 3.7173 | 1.4 steps/s | Step: 57k | }
Included files:
Overall I would say the vocoders are starting to sound okay and it appears they are working without the synthesizer aligning. According to the Pretrained models you trained the vocoder for 428k steps. I'll let these two models train until a similar target number of steps.
Loss is still high at 3.6495 on the english model and 3.6416 on mixed. However, the quality is improving quite a bit. There are a few examples of generated audio that sounds just as good if not better than the original.
Based on generated examples while training, both models (from my perspective) do a better job on male than female speakers. While some of the generated audio sounds excellent, there are quite a few that have artifacts (pops, high-pitched, static, etc).
If anyone wants to listen to more generated examples, I will be happy to share them.
@sberryman is the italian language included in this pretrained model?
@frossi65 The Italian language is only used as part of the encoder training. I did NOT use Italian as part of the synthesizer or vocoder training.
@sberryman thanks for your quick answer.
@sberryman I'd be interested in hearing more samples.
In my experience, target=16000 overlap=800 produces high quality pop-free audio. I used it to make Dr. Kleiner sing: https://www.reddit.com/r/HalfLife/comments/d2rzf0/deepfaked_dr_kleiner_sings_i_am_the_very_model_of/
@shawwn I've attached the mixed and English results. Personally the mixed sounds better but I'm not convinced this is a very good model as the loss is very high. Not sure if @CorentinJ has an opinion on the loss, maybe that is expected? I'm assuming the loss is quite high as the synthesizer never managed to align.
Sorry but I can't quite remember what the loss was like when I trained the models. You could try to continue the training with my model and see what gives. The raw of value of the loss itself doesn't hold much meaning until you manage to compare it to a baseline.
@sberryman Would you be willing to upload your current encoder, synth, and vocoder models? Even if it's not finished training yet, I'd like to experiment with them.
Bonus points if you upload the tensorboard logs too :)
The samples sound promising!
@shawwn I've uploaded the models to my dropbox. The vocoder is still training and will be for another 24-48 hours. Please share whatever you end up making with them!
https://www.dropbox.com/s/xl2wr13nza10850/encoder.zip?dl=0
https://www.dropbox.com/s/t7qk0aecpps7842/tacotron.zip?dl=0
@shawwn - Have you tried the models yet? I was just doing some testing and every voice I tried to clone sounded the same. Wondering if you experienced the same? (They all sounded robotic and female)
My assumption is the synthesizer and vocoder didn't train properly as I'm able to cluster voices using the encoder.
I've had similar problems with the voice when i've used the wrong encoder for the synthesizer. My test run where i only did a few steps on each model was able to produce different voices atleast in the direction of the encoded voice. I dont have physical or remote access to my machine atm so cant see exactly what you mean.
I did some more data preprocesssing and my swedish model seems to work alot better on male voices now. Have not trained the vocoder yet on this run just tried the synthesizer on griffin-lim.
Thanks for the feedback @ViktorAlm! I went back and double checked that I was using the correct encoder, synthesizer and vocoder for each path I'm training and they all sounded the same. It was only a quick test using demo_cli.py
What preprocessing did you do to help train the synthesizer?
@sberryman Hey, I have an idea that might make it possible to add more voices for training. A possibly large, untapped "dataset" is voice files ripped from video games. The Sounds Resource is probably one of the largest repositories of video game sound effects. You can just specially look up files for character dialogue, most of which is clean audio recorded in a studio (this mainly applies to video games made within the last 20 years).
The only limitation is that these voice clips would only be useful for the encoder since they unfortunately have no alignments. The upside is that there's a large variety of speakers, accents, and even options for Japanese dialogue if the game is from Japan. Most games made in the English-speaking world probably fall under EFIGS (English, French, Italian, German, Spanish) if they have been localized in Europe, so there might be options for those languages as well.
AAA games have the largest amount of voice actors, so it may be of interest to look into games like Skyrim, Fallout 4, GTA V, etc. since there's a large amount of NPC character dialogue.
Also, here are some links that may be helpful for finding new datasets: https://www.cmswire.com/digital-asset-management/9-voice-datasets-you-should-know-about/ https://towardsdatascience.com/a-data-lakes-worth-of-audio-datasets-b45b88cd4ad https://lionbridge.ai/datasets/12-best-audio-datasets-for-machine-learning/ https://skymind.ai/wiki/open-datasets https://voices18.github.io/
An interesting dataset that I found recently is The Spoken Wikipedia Corpora.
@Tiege95 Thanks so much for sharing; I'll be checking these sources out this evening! Have you attempted to train a model? Would really like to hear others experience on what worked or didn't work.
@sberryman I'm currently unable to experiment with this program since I don't have a computer with the proper specs to run it, but I love reading up on this kind of stuff. I figured that The Sounds Resource, while it's not technically not a dataset made for machine learning applications, is a huge resource of voice recordings. The PC/Computer section alone has ~1000 games to download sounds from (Overwatch, Dragon Ball Xenoverse, Half-Life, etc.). Voice files usually just have the corresponding character's name or are listed under something like "Cutscene Voices".
@Tiege95 sorry for the 2+ week delay, somehow I missed your message. Any chance you've written a script to download all the voice/speech files from Sounds Resource? I was looking through it today and definitely a lot of clean audio from game characters.
On a side note, I got the flu and decided to let the English model keep training while stuck in bed. That model is up to over 1.5 million steps now. (768/768 embedding/hidden size and 17,688 speakers) This model has been training for almost 28 days now.
Then for fun I decided to start training a 1024/1024 model with same 17,688 English speakers and the remaining 9,744 mix of other languages. With a single 1080 TI training the large embedding model it is taking quite a long time. Up to 379k steps over ~7 days of training. The graph isn't complete due to a 12+ hour power outage.
@sberryman Sorry, I don't have a script for that site.
Thanks for publishing the code and basic training instructions!
Environment
Datasets: (9,063 speakers)
I'm working on adding TEDLIUM_release-3 which would add 1,925 new speakers and potentially SLR68 which would add 1,017 Chinese speakers but would require some clean up as there is a lot of silence in the audio files.
Hyper Parameters: Left all parameters untouched.
Encoder training:
39,300 steps:
115,900 steps: (almost exactly 24 hours of training)
Typical step
Questions