Closed ghost closed 4 years ago
@mbdash I just noticed this. This would be a really nice contribution if you are up for it!
On the pretrained models page it says the synthesizer was trained in a week on 4 GPUs (1080ti). If you are not willing to tie up your GPU for a full month, it will still be helpful if you can get to a partially-trained model that has intelligible speech so others can continue training and finetuning.
python synthesizer_preprocess_audio.py path/to/datasets_folder --no_alignments --datasets_name LibriTTS
python synthesizer_preprocess_embeds.py path/to/datasets_folder/SV2TTS/synthesizer
python synthesizer_train.py new_model_name path/to/datasets_folder/SV2TTS/synthesizer
You can quit and resume training at any time, though you will lose all progress since the last checkpoint. It will be interesting to see how well it does with default hparams.
From what I understand, LibriTTS offers several advantages over LibriSpeech:
We should consider updating the hparams so we can ultimately generate 24 kHz audio from this:
@CorentinJ also suggests reducing the max allowable utterance duration (these hparams are used in synthesizer/preprocess.py): https://github.com/CorentinJ/Real-Time-Voice-Cloning/blob/054f16ecc186d8d4fa280a890a67418e6b9667a8/synthesizer/hparams.py#L95-L103
I don't have any solutions for the other suggestions mentioned (switching attention paradigm, removing speakers with bad prosody): https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-653996443
Ok, I will sync LibriTTS overnight, try to set this up over the weekend and get the GPU working on it.
Update 2020-07-25 22h20 EST:
step 5 Generate mel spectrograms for training Currently at 25% all cpus available to VM are at full load.
For posterity, note typo in command in step 5, missing "s" in the flag "--datasets_name" python synthesizer_preprocess_audio.py ~/rtvc_LibriTTS/datasets --no_alignments --datasets_name LibriTTS
Thanks for the update and correction.
Let's run training with the default hparams. We're already switching from LibriSpeech to LibriTTS and it's best to only change one parameter at a time.
Hi have an error because synthesizer_preprocess_embeds.py wants a pretrained model?
I fail to understand why we need to provide pre-trained data when trying to train from scratch, but i will stick in the latest pretrained model until told otherwise.
(rtvc_py373) username@vm:~/github/Real-Time-Voice-Cloning$ python synthesizer_preprocess_embeds.py /mnt/nfs/a_share/rtvc_LibriTTS/datasets/SV2TTS/synthesizer/
Arguments:
synthesizer_root: /mnt/nfs/a_share/rtvc_LibriTTS/datasets/SV2TTS/synthesizer
encoder_model_fpath: encoder/saved_models/pretrained.pt
n_processes: 4
Embedding: 0%| | 0/111521 [00:02<?, ?utterances/s]
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/home/username/github/Real-Time-Voice-Cloning/synthesizer/preprocess.py", line 228, in embed_utterance
encoder.load_model(encoder_model_fpath)
File "/home/username/github/Real-Time-Voice-Cloning/encoder/inference.py", line 33, in load_model
checkpoint = torch.load(weights_fpath, _device)
File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/site-packages/torch/serialization.py", line 384, in load
f = f.open('rb')
File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/pathlib.py", line 1186, in open
opener=self._opener)
File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/pathlib.py", line 1039, in _opener
return self._accessor.open(self, flags, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "synthesizer_preprocess_embeds.py", line 25, in <module>
create_embeddings(**vars(args))
File "/home/username/github/Real-Time-Voice-Cloning/synthesizer/preprocess.py", line 254, in create_embeddings
list(tqdm(job, "Embedding", len(fpaths), unit="utterances"))
File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/site-packages/tqdm/std.py", line 1130, in __iter__
for obj in iterable:
File "/opt/miniconda3/envs/rtvc_py373/lib/python3.7/multiprocessing/pool.py", line 748, in next
raise value
FileNotFoundError: [Errno 2] No such file or directory: 'encoder/saved_models/pretrained.pt'
@mbdash Look at the middle part of the image here and hopefully it will make more sense why the pretrained encoder model is needed to generate embeddings for synthesizer training: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/30#issuecomment-508381648 Please speak up if it still doesn't make sense.
Think of the synthesizer as a black box with 2 inputs: an embedding, and text to synthesize. Different speakers sound different even when speaking the same text. The synthesizer uses the embedding to impart that voice information in the mel spectrogram that it produces as output. The synthesizer gets the embedding from the encoder, which in turn can be thought of a black box that turns a speaker's wav data into an embedding.
So you need to run the encoder model to get the embedding, and you get the error message because it can't find the model.
Ok, great, if you tell me it is as designed I will continue. it is currently at 50% Embedding.
I opened the image but I need slightly more coffee to really look at it ;-)
thx for the quick response.
ok I started synthesizer_train.py and it is @ step 250 now @ 2020 07 26 12H24 EST
Wow that is fast. At that rate it will take just over 4 days to reach the 278k steps in the current model. And it will train even faster as the model gets better. Please share some griffin-lim wavs when they become intelligible.
step 2850 @ 13H15 EST so approx 2500 steps in ~1h
Generated 64 train batches of size 36 in 21.814 sec
This seems to be a bottleneck, is the data on an external drive? I'm averaging about 14 sec for batch generation on a slow CPU but the data lives on a SSD.
latest @ 14h25:
My setup is not optimal. It is currently residing on the HDD side on my array, I just added a new SSD but is is not been used atm. When I stop the training, I will move the data on a share living on the SSD or even a passthrough NVME.
If that's a typical batch generation time now, 2.3 sec for 64 batches is just 0.036 sec per step or 1 hour over 100,000 steps. Not worth it to transfer the data over to the SSD in my opinion.
step 10k reached @ 15h30 so we can estimate ~10k steps / 3h
Where are located the wavs you want me to share?
When I try to ls datasets/SV2TTS/synthesizer/audio
my terminal hang.
Where are located the wavs you want me to share?
Check out the training logs area: synthesizer/saved_models/logs-new_model_name/wavs
The files in the plots
folder are also interesting and show how well the new synthesizer model is working.
rtvc_libritts_s_mdl @ 10k steps
Cheers! rtvc_libritts_s_mdl_10k.zip
Overall, the synthesizer training seems to be progressing nicely! I'll be interested to see as many plots and wavs as you care to share, but otherwise it's a lot of waiting now.
It would be nice if you can share in-work checkpoints, say starting at 100k and every 50k steps after that. Or generate some samples using the toolbox. I've never trained from the start and it would be interesting to see the progression.
rtvc_libritts_s_mdl @ 20k steps in ~6h
I used the original pretrained models (hereafter, LibriSpeech_278k) to synthesize the same utterance as the 20k example, also inverting it with Griffin-Lim. The clarity is about the same but there is less harshness with LibriSpeech_278k (not sure what the correct technical term for that is).
"When he spoke of the execution he wanted to pass over the horrible details, but Natasha insisted that he should not omit anything."
You can definitely hear more of a pause after "details" in the 20k wav so the new model is learning how to deal with punctuation!
rtvc_libritts_s_mdl @ 74k steps in ~21h
@mbdash From that batch I find the 50k sample remarkable. Your LibriTTS-based model is much closer to the ground truth, capturing the effect of the 3 commas and question mark on prosody.
For this one clip I say your model performs better than LibriSpeech_278k but it will be interesting to see how well the model generalizes to new voices (embeddings) unseen during training.
As they sat thus something brushed against peter as light as a kiss, and stayed there, as if saying timidly, "Can I be of any use?"
Yes I keep listening to them paying attention to details and I can clearly ear the tts using the punctuation.
How long does it take to run each step now? Clearly it is progressing faster than 1.3-1.4 sec/step that is in the screenshot from yesterday.
I don't think the numbers are very accurate
I try counting Mississippis but they pop / print way faster and sometimes in fast sequences
It is a moving average of the last 100 steps:
102K reached in approx ~30h i think
Can you make a backup of the 100k model checkpoint (or one that is in this range)? Just in case we want to come back to it later.
Is the average loss still coming down? Perhaps it converges much faster with LibriTTS. When I did the single-speaker finetuning on LibriSpeech p211 the synthesizer loss started at 0.70, and you are already in the 0.60-0.65 range.
which files do you want me to backup so i don't mess this up? I don't want to loose any of that work. (117K now) I'll zip it and share.
The files in synthesizer/saved_models/logs-new_model_name/taco_pretrained
What we need is:
tacotron_model.ckpt-######.data-00000-of-00001
tacotron_model.ckpt-######.meta
tacotron_model.ckpt-######.index
checkpoint
Every time it reaches a new checkpoint interval it overwrites the oldest checkpoint. It's good to keep a few intermediate checkpoints in case something gets messed up along the way.
For the next synth model, I will update the code to include a few user-defined custom embedding parameters that are concatenated with the speaker embedding. These would all default to zero, but could be used to represent things like language or accent to faciltate fine-tuning and perhaps speed up training if the classification is known.
Currently, we cannot finetune an accent on the models in a way that generalizes to new speakers for voice cloning (see https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-664704917). My hypothesis is that the accent is attributed to the speaker embedding (of the dataset used for finetuning), so it never generalizes. This would give us a tool to help get around that limitation.
Edit: This is essentially implementing Global Style Tokens: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/230 . Will use Mozilla's repo as a guide to follow.
The training on the synth reached 200k, I stopped it to give a break to the server.
I am still downloading the datasets for the encoder, i will get started on it tomorrow.
@mbdash Are you able to share the 200k checkpoint files or vocoded samples at the very least? I'd like to see how well the 200k model performs!
Synth Trained on LibriTTS 200k steps with old /original encoder.
https://drive.google.com/drive/folders/1ah6QNyB8jIcFuKusPOVdx0pPIZxeZeul?usp=sharing
Let me know if the link works. or not and if any files are missing.
Thanks @mbdash ! I got it to work but needed to put it in a folder structure like this:
logs-LibriTTS_200k
* taco_pretrained
* checkpoint
* tacotron_model.ckpt-200000.data-00000-of-00001
* tacotron_model.ckpt-200000.index
* tacotron_model.ckpt-200000.meta
The checkpoint
is not included but it is easy enough to make it. It is a text file with a single line:
model_checkpoint_path: "tacotron_model.ckpt-200000"
So far I am finding cloned voices sound nearly identical to Corentin's LibriSpeech_278k model, with better performance for very short text inputs (1-5 words). It is still liable to have gaps, but they are not multiple seconds like we have with LibriSpeech_278k. The synthesizer can fail spectacularly, but this is a rare exception and not the norm. Some punctuation has an effect (periods and commas), but I don't notice anything with question marks. I think question marks would be better handled using a global style token like we are discussing in #230.
Overall an improvement over the existing model, though a slight one. This is all we could expect.
Great to hear, I am still downloading the voxceleb files. Once done, i will train the encoder and we can try again training the synth from scratch.
If anyone else is silently following along I would appreciate any comments on the LibriTTS_200k model (https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/449#issuecomment-665645153) so we can use that feedback to make the next one better.
Hi @mbdash and thank you for sharing this new synth model! I have tried it and it seems the voice is identical wrt the one generated by the old synth model. To me also the other was fairly similar to the input voice. However, I find it to be noisier compared to the old one. Do you think you can achieve better performances by also training the encoder? Is the noise due to some imperfections in the embedding computation phase?
@shoegazerstella Thank you for reporting the issue, would you please share some audio samples with us that demonstrate what you are talking about?
Just to speculate, the audio preprocessing could be adding noise or other artifacts into the sound files, it is worth doing a before and after comparison. LibriTTS is 24 KHz instead of the 16 KHz in LibriSpeech (used to train the original models), and since it's not an integer multiple this means our training data also needs to be interpolated as it is resampled. The librosa resampling process can be found in: librosa/core/audio.py (the actual resampling is done by scipy or resampy)
However I think that is unlikely. Could also be due to fewer training steps (200k vs 278k). Also LibriSpeech utterances are longer on average than LibriTTS so for a given number of steps I would expect a more refined model from LibriSpeech.
Hi @blue-fish , shure I can share some examples here:
Thank you for the explanation on the preprocessing steps! I have one question, was the model trained from scratch for LibriTTS or you started from a pre-trained model done on LibriSpeech_278? Do you think this approach could make sense for increasing its performances?
Thanks for sharing the samples @shoegazerstella ! The increased noise on LibriTTS_200k is quite obvious. In addition to more training I think it could also benefit from a new vocoder.
LibriTTS_200k is trained from scratch. We have several problems with LibriSpeech_278k, the most annoying of which is the long gaps that appear in the middle of spectrograms (#53). The training from scratch is part of an effort to fix these issues: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-653996443 .
I think the next step is to lower max_mel_frames
and find some way to clean up LibriTTS (probably calculate the ratio of wav to transcript lengths and removing outliers).
Switching to a pytorch-based synthesizer in #447 may also help since the Rayhane-mamah tacotron that we currently use has some known bugs that would go away by switching to fatchord's implementation in WaveRNN.
Would anyone else like to contribute a GPU to help develop a better synthesizer model? Reply here and get started by preprocessing LibriTTS: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/449#issuecomment-663785345
Hi @blue-fish We should be able to contribute too for retraining the model. The max we can use is v100 GPUs. I'll make some trials and see how many we can provide. How long do you think it would take? if not to fully complete, at least to achieve something you can after continue and finish? I am now downloading LibriTTS and will proceed with its preprocessing following the steps you suggested in the comment above, will let you know before starting the train so we can discuss if some hparam needs to be changed.
@shoegazerstella Thank you so much! It expect it will take 4-7 days to get a pretrained model for each config. Maybe half that if we're just testing hparams and not training to perfection. As a reference point, @mbdash trained LibriTTS_200k in just over 2 days on a 2080ti. Please download the torch-based synthesizer from #472. This will be our new code base which will eventually support global style tokens (#230).
Since putting out the request for help, I discovered that we will need a new vocoder so we should take this opportunity to increase the sample rate to 22,050 or 24,000 Hz. This will require preprocessing to be restarted, but we will get better audio quality in the end.
Do you need me to push the updated hparams to my fork, or do you prefer to figure it out yourself? Note the preprocessing scripts in #472 still reference the old synth, so you will need to modify the old synth's hparams for preprocessing.
I notice that at a low number of steps (say 25k), inference is very sensitive to trailing punctuation. For example Hello world
(top plot) synthesizes with a lot of trailing emptiness, while Hello world.
(bottom plot) cleanly terminates. The LibriTTS_200k model from @mbdash shows that it can be overcome with additional training, but I do not like this behavior.
Now experimenting with stripping trailing punctuation which should use the end of sequence symbol "~") as an indication of when to stop, instead of the punctuation. If it works well I will add an hparam to ignore punctuation at the end of a text.
Also, now restricting the training set to 500 mel frames or less (default 900) to avoid long silences in the middle of utterances (https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-653996443). Here is the code snippet I use to post-process datasets_root/SV2TTS/synthesizer/train.txt
to implement both of these changes:
from pathlib import Path
import string
with Path("train.txt").open("r") as metadata_file:
metadata = [line.split("|") for line in metadata_file]
max_frames = 500
x0 = [x[0] for x in metadata if int(x[4])<=max_frames] # audio filename
x1 = [x[1] for x in metadata if int(x[4])<=max_frames] # mel filename
x2 = [x[2] for x in metadata if int(x[4])<=max_frames] # embed filename
x3 = [x[3] for x in metadata if int(x[4])<=max_frames] # timesteps
x4 = [x[4] for x in metadata if int(x[4])<=max_frames] # mel frames
x5 = [x[5] for x in metadata if int(x[4])<=max_frames] # text
with Path("train_edit.txt").open("w") as output_file:
for i in range(len(x0)):
text = x5[i].strip().strip(string.punctuation) # first strip() removes newline
output_file.write("|".join([x0[i], x1[i], x2[i], x3[i], x4[i], text]) + "\n")
Hi @blue-fish So I cloned your fork,
Do you need me to push the updated hparams to my fork, or do you prefer to figure it out yourself? Note the preprocessing scripts in #472 still reference the old synth, so you will need to modify the old synth's hparams for preprocessing.
For preprocessing, I am modifying the hparams here, is that correct? I will change the sample rate to be 22,050. Do I need to also change hop and win_length accordingly? How can I figure out what values to assign?
Thanks!
Hi @shoegazerstella !
Something else I discovered since then, I made a mistake in how I was passing the data to the vocoder. Once I fixed it, I found that the original vocoder (16,000 Hz) works quite well. Since I am already training a model at 16,000 Hz, why don't you use 22,050 Hz for better quality? We don't have a 22,050 Hz vocoder model so it will be a nice contribution.
I have also had good results with changing max_mel_frames to 500. This has the following benefits:
What I am currently struggling with is punctuation. If my text has a comma, then my model introduces a 3-4 second pause. Additional training should fix it.
@shoegazerstella You might want to run synthesizer_train.py with -s 500
to save the model every 500 steps (that way you do not lose too much progress when stopping and restarting)
Hi @blue-fish thanks a lot for your help! Training is now in progress, the configuration follows the parameters you suggested above.
I had another little issue similar to https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/439#issuecomment-673349904, thus it seems it is processing 24353 samples only. Is that correct? thanks!
Initialising Tacotron Model...
Trainable Parameters: 24.888M
Starting the training of Tacotron from scratch
Using inputs from:
DATA/SV2TTS/synthesizer/train.txt
DATA/SV2TTS/synthesizer/mels
DATA/SV2TTS/synthesizer/embeds
Found 24353 samples
+----------------+------------+---------------+------------------+
| Steps with r=7 | Batch Size | Learning Rate | Outputs/Step (r) |
+----------------+------------+---------------+------------------+
| 10k Steps | 32 | 0.001 | 7 |
+----------------+------------+---------------+------------------+
/opt/conda/lib/python3.6/site-packages/torch/nn/modules/rnn.py:211: RuntimeWarning: RNN module weights are not part of single contiguous chunk of memory. This means they need to be compacted at every call, possibly greatly increasing memory usage. To compact weights again call flatten_parameters().
self.dropout, self.training, self.bidirectional, self.batch_first)
| Epoch: 1/14 (762/762) | Loss: 0.8026 | 1.0 steps/s | Step: 0k | [[B
| Epoch: 2/14 (762/762) | Loss: 0.7637 | 1.0 steps/s | Step: 1k |
| Epoch: 3/14 (476/762) | Loss: 0.7511 | 1.0 steps/s | Step: 2k | Input at step 2000: my dear child, i said grandly, do you really suppose i am afraid of that poor wretch?~__________________________
| Epoch: 3/14 (762/762) | Loss: 0.7460 | 1.0 steps/s | Step: 2k |
| Epoch: 4/14 (361/762) | Loss: 0.7274 | 1.0 steps/s | Step: 2k |
I restarted the training from scratch with the correct number of samples, I am now at step 8k. I will share later some spectrogram plots + wavs.
Originally posted by @mbdash in https://github.com/CorentinJ/Real-Time-Voice-Cloning/pull/441#issuecomment-663076421