Closed ghost closed 4 years ago
Pretrained synthesizer + 200 steps of training on VCTK p240 samples (0.34 hours of speech). Still using the original vocoder model. This is just a few minutes of CPU time for fine-tuning. It is remarkable that the synthesizer is already imparting the accent on the result. This is good news for anyone who is fine-tuning an accent: it should not take too long, even for multispeaker.
I did notice a lot more gaps and sound artifacts than usual with the finetuned model (this result is cherry-picked). Is it because I did not hardcode all the samples to a single utterance embedding?
Here are some samples from the latest experiment. VCTK p240 is used to add 4.4k steps to the synthesizer, and 1.0k to the vocoder. Synthesized audios have filename speaker_utterance_SYN_VOC.wav
and use all combinations of pretrained ("pre") and finetuned ("fin") models for the synthesizer and vocoder, respectively.
Synthesized utterances using speaker p240's hardcoded embedding (derived from p240_001_mic1.flac) show the success in finetuning to match the voice, including the accent. Samples made from speaker p260's embedding demonstrate how much quality is lost when finetuning a single-speaker model.
In these samples, the synthesizer has far more impact on quality, though this result could be due to insufficient finetuning of the vocoder. Though the finetuned vocoder has only a slight advantage over the original for p240, it severely degrades voice cloning quality for p260.
Also compare to the samples for p240 and p260 in the Google SV2TTS paper: https://google.github.io/tacotron/publications/speaker_adaptation/
Here is a preprocessed p240 dataset if you would like to repeat this experiment. The embeds for utterances 002-380 are overwritten with the one for 001, as the hardcoding makes for a more consistent result. Use the audio file p240_001.flac
to generate embeddings for inference. The audios are not included to keep the file size down, so if you care to do vocoder training you will need to get and preprocess VCTK.
Directions:
synthesizer/saved_models/logs-pretrained
to logs-vctkp240
in the same location. This will make a copy of your pretrained model to be finetuned.datasets_p240
in your Real-Time-Voice-Cloning folder (or somewhere else if you desire)python synthesizer_train.py vctkp240 dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100
dataset_p240/p240_001.flac
to generate the embeddingWow that is amazing... I only asked your opinion and you actually did it!
The difference is incredible.
Now I just need to dumb down all you wrote to be able to reproduce it.
Also try your_input_text.replace('hi', 'eye') it is a little cheat that I find gives better results currently. At least in the multi speaker model.
Now I just need to dumb down all you wrote to be able to reproduce it.
@mbdash In the first post I included a dropbox link that has fairly detailed instructions for the single-speaker LibriSpeech example. You can try that and ask if you have any trouble reproducing the results. If you want VCTKp240 I can make a zip file for you tomorrow.
This was much easier and faster than expected. I am sharing the results to generate interest, so we can collaborate on how much training is needed, best values of hparams, etc.
Thank you, I will look at it tomorrow morning I am only staying up for a few more minutes, I am a bit too tired to think straight right now..
Tonight I am trying to keep it simple and see if I can Jam a regular "hand modeled" 3D head mesh into VOCA (Voice Operated Character Animation) (another GitHub project)
Update: nope it exploded.
Some general observations to share:
Also I did another experiment and trained the synthesizer for about 5,000 additional steps on the entire VCTK dataset (trying to help out on #388). The accent still does not transfer for zero-shot cloning. I suspect the synthesizer needs to be trained from scratch if that is the goal.
P.S. @mbdash I updated the VCTKp240 post with a single-speaker dataset if you would like to try that out. https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-663308789
Also I did another experiment and trained the synthesizer for about 5,000 additional steps on the entire VCTK dataset (trying to help out on #388). The accent still does not transfer for zero-shot cloning. I suspect the synthesizer needs to be trained from scratch if that is the goal.
Changing my mind on training from scratch, I think we just need to add an extra input parameter to the synthesizer which indicates the accent or more accurately the dataset that it is trained on. A simple implementation might be a single bit representing LibriSpeech or VCTK. Next, finetune the existing models on VCTK with the added parameter. Then for inference specify the dataset that you want the result to sound like. I'm at a loss how to implement this with the current set of models, but I think this repo will have clues: https://github.com/Tomiinek/Multilingual_Text_to_Speech
I'm all done with accent experiments for now but I hope this is helpful to anyone who wants to continue this work.
"I am going to write a tool that will allow users to manually select or record files to add to a custom dataset, and facilitate transcription (maybe using DeepSpeech). This tool will be hosted in a separate repository."
Thank you for all your hard work on this repo - even as an almost complete newcomer to deep learning, I've been able to decipher some things, but I'm still stymied by the inability to create custom datasets from scratch. Are you still working on this "custom dataset" tool that you mention here?
@blue-fish any reason why im getting the following error: "synthesizer_train.py: error: the following arguments are required: synthesizer_root"? I'm trying to run:
synthesizer_train.py H:\ttss\Real-Time-Voice-Cloning-master\dataset_p240\SV2TTS\synthesizer --checkpoint_interval 100
the second argument is the folder that contains embeds, mels, and train.txt
nevermind I fixed it while writing this. The argument isn't --synthesizer_root as all of the other arguments, but actually just synthesizer_root. Also, the above testing instructions are thus wrong (or at least not working for me). The command should be:
python synthesizer_train.py synthesizer_root dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100
(it at least bumped me to a dll error - still working through that one)
I'm still stymied by the inability to create custom datasets from scratch. Are you still working on this "custom dataset" tool that you mention here?
Hi @Adam-Mortimer. The custom dataset tool is still planned, but currently on hold as I've just started working #447 (switching out the synthesizer for fatchord's tacotron). #447 will be bigger than all of my existing pull requests combined if it ever gets finished. In other words, it's going to take quite some time.
I started writing the custom dataset tool for a voice cloning experiment. I didn't get very far with the tool before I added LibriTTS support in #441 which made it much easier to create a dataset by putting your data in this kind of directory structure:
datasets_root
* LibriTTS
* train-clean-100
* speaker-001
* book-001
* utterance-001.wav
* utterance-001.txt
* utterance-002.wav
* utterance-002.txt
* utterance-003.wav
* utterance-003.txt
Where each utterance-###.wav
is a short utterance (2-10 sec) and the utterance-###.txt
contains the corresponding transcript. Then you can process this dataset using:
python synthesizer_preprocess_audio.py datasets_root --datasets_name LibriTTS --subfolders train-clean-100 --no_alignments
When this completes, your dataset is in the SV2TTS format and subsequent preprocessing commands (synthesizer_preprocess_embeds.py
, vocoder_preprocess.py
) will work as described on the training wiki page.
I would still like to write the custom dataset tool but I think #447 is a more pressing matter since the toolbox is incompatible with Python 3.8 due to our reliance on Tensorflow 1.x.
@Ori-Pixel There was a problem with my command and I fixed it. If you are following everything to the letter it should be:
python synthesizer_train.py vctkp240 dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100
Where the first arg vctkp240
describes the path to the model you are training (in this case, it tells python to look for the model in synthesizer/saved_models/logs-vctkp240
), and the second arg is the path to the location containing train.txt, and the mels and embeds folders. Please share your results and feel free to ask for help if you get stuck.
@blue-fish thanks. yeah, I can see that it's saving to a new directory, I'll run it again with the correct params and post results.
Also, thanks for the preprocessing tips you gave to @Adam-Mortimer . I was not looking forward to custom labeling, but it doesn't seem that bad if I only have ~200 lines/~34 minutes. I'm trying to make a fake (semi-Gaelic) accent video game character say some lines, so I'll probably scrape the audio files from the wiki site, then slap them into a folder structure like above with a simple script and then run this single speaker fine tuning again. And for the accent, I think I can just find a semi-close one in the VCTK dataset (although a 10Gb download will take me a few days sadly).
@Ori-Pixel If you have a GPU you can quickly run a few experiments to see how far you can trim the dataset before the audio quality breaks down. Simply delete lines from train.txt and they won't be used.
One of my experiments involved re-recording some of the VCTK p240 utterances with a different voice. 5 minutes of mediocre data (80 utterances) still resulted in a half-decent model. If the labeling is extremely tedious you can try training a model on part of it while continuing to label.
I have preprocessed VCTK, if you can make your decision based on a single recording, request up to 3 speakers and I'll put them on dropbox for you. https://www.dropbox.com/s/6ve00tjjaab4aqj/VCTK_samples.zip?dl=0
Oh, and just to be clear, you cannot train the voice and accent independently at this time. The accent is associated with the voice via the speaker embedding. After #447, we will work on #230 to add the Mozilla TTS implementation of GSTs. That should allow us to generalize accents to new voices.
@blue-fish
I have preprocessed VCTK, if you can make your decision based on a single recording, request up to 3 speakers and I'll put them on dropbox for you. https://www.dropbox.com/s/6ve00tjjaab4aqj/VCTK_samples.zip?dl=0
Is there a list of their speakers somewhere? I only was able to find the 10GB file with not even a magnet link or anything denoting samples or file structure. I mean realistically anything Irish, Scottish, or Gaelic would work. I may also look into downloading it direct to drive (if possible) and even possibly training there (if possible -- as far as I'm aware you can mount the drive and run bash.)
Oh, and just to be clear, you cannot train the voice and accent independently at this time. The accent is associated with the voice via the speaker embedding. After #447, we will work on #230 to add the Mozilla TTS implementation of GSTs. That should allow us to generalize accents to new voices.
Yeah, I just meant using a vctk pretrained that wasn't horribly inconsistent with my single speaker's accent and then fine tuning with my custom labeled lines on top.
I also have a couple idle GPUs in my machine but I always run into venv issues with gpu training so I'll just use colab if I really need a GPU. Too bar downloading from a link to
Is there a list of their speakers somewhere?
The zip file I uploaded includes speaker-log.txt
(which is included in the full VCTK dataset) which has a list of speaker metadata such as:
ID AGE GENDER ACCENTS REGION COMMENTS
p225 23 F English Southern England
p226 22 M English Surrey
p227 38 M English Cumbria
p228 22 F English Southern England
Ah I see. I'll give it a look tomorrow along with the results and let you know then, thanks again for being so active!
@blue-fish p261 is relatively close. if I could get that slice, that would be very helpful (my internet at my current house is sadly 1MB/s)
I trained as per instructions above, sadly I didnt get to see the console output as my power went out after about an hour or so, but I did get this in the training logs, so I think this is as far as it trained.
[2020-07-30 01:36:31.676] Step 278202 [28.894 sec/step, loss=0.64379, avg_loss=0.64339]
Also, just to make sure I did the test, this is the cmd I used:
python demo_toolbox.py -d H:\ttss\Real-Time-Voice-Cloning-master\dataset_p240
Where random seed = 1, enhanced vocoder output is checked, embedding was from p240_1.flac.
resulting audio: https://raw.githubusercontent.com/Ori-Pixel/files/master/welcome_to_toolbox_fine_tuned.flac
@Ori-Pixel
Here is the dataset in the same format as p240 (embeds overwritten with the one corresponding to p261_001.flac): https://www.dropbox.com/s/o6fz2r6w56djwkf/dataset_p261.zip?dl=0
SV2TTS/synthesizer/audio
): https://www.dropbox.com/s/q3bihpem7os54yi/p261_audio.zip?dl=0resulting audio: https://raw.githubusercontent.com/Ori-Pixel/files/master/welcome_to_toolbox_fine_tuned.flac
Your results sound American to me. Check that you are using the new synthesizer model, then try this text: Take a look at these pages for crooked creek drive.
And compare to my results for 200 steps: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-663050677
Check that you are using the new synthesizer model
Ah, I didn't have that drop down selected. My results are then this, with the same settings:
I'm also taking your comment above and trying to train my own dataset, but at first I got a dataset roots folder doesnt exist error, so I made the folder and added my files, but when I go to train, I get:
Arguments:
datasets_root: datasets_root
out_dir: datasets_root\SV2TTS\synthesizer
n_processes: None
skip_existing: False
hparams:
no_alignments: False
datasets_name: LibriTTS
subfolders: train-clean-100
Using data from:
datasets_root\LibriTTS\train-clean-100
LibriTTS: 0%| | 0/1 [00:00<?, ?speakers/s]2
gpu warnings here
LibriTTS: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.20s/speakers]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 59, in <module>
preprocess_dataset(**vars(args))
File "H:\ttss\Real-Time-Voice-Cloning-master\synthesizer\preprocess.py", line 49, in preprocess_dataset
print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence
Utterances have just the text that was spoken in them, so utterance-000.txt contains Let's have some fun, shall we...
edit: I assume I will need to go through the training docs and start by training the encoder?
@Ori-Pixel You also need to add the --no_alignments
option to use a non-LibriSpeech dataset that doesn't have an alignments file. I've also fixed the command in the instructions above. Sorry for leaving that out earlier.
python synthesizer_preprocess_audio.py datasets_root --datasets_name LibriTTS --subfolders train-clean-100 --no_alignments
Edit: If preprocessing completes without finding a wav file, we should remind the user to pass the --no_alignments
flag. Or possibly default it to True if the datasets_name is not LibriSpeech.
@blue-fish Okay, so I got it to train, and I can also train my own dataset for the synthesizer. Really thankful for the help. Here's a result from 200 steps of training if you're interested:
https://raw.githubusercontent.com/Ori-Pixel/files/master/crooked_creek_dw.flac
https://raw.githubusercontent.com/Ori-Pixel/files/master/biggest_oversight.flac
@Ori-Pixel Nice! It's remarkable how much that voice comes through after 200 steps of finetuning. In my own experiments going up to 400 steps yields a noticeable improvement in the voice quality. More than 400 doesn't seem to help, though it doesn't hurt either.
Edit: You trained on CPU right? How long did it take?
@blue-fish I did train on CPU(autocorrect!!) (I always have issues with gpu setup. Luckily im building a new pc when the 30xx cards drop with the new zen2 amd cpus). After trying to train from 200-400 it would seem that it takes ~25s per step after 20 steps, so around 2 hours for 200 steps on i5 4690k.
The next steps for me would be encoder/vocoder training but I don't want to invest the compute power since Im working on another NLP problem for my actual research (sentiment analysis) I'll let it run overnight again and this time see how far it gets :)
edit: as @blue-fish said, it seems training it to 400 steps made a large difference. Here's an example of the same voice as above, but with 400 steps of training the p261 set on my own collected voice samples:
original voice: https://raw.githubusercontent.com/Ori-Pixel/files/master/Vo_dark_willow_sylph_attack_14.mp3 200 steps: https://raw.githubusercontent.com/Ori-Pixel/files/master/biggest_oversight.flac 400 steps: https://raw.githubusercontent.com/Ori-Pixel/files/master/dark%20willow%20400.flac
@blue-fish I did exactly what you said, after over 10000 steps with the synthesizer, I try to open the toolbox. I type the text to convert into the box, and I get some unrelated text in an almost incomprehensible ramble.
@adfost Which set of instructions are you following? LibriSpeech (https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issue-663639627) or VCTKp240 (https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-663308789)?
Most likely, when you run synthesizer_train.py
it cannot find the pretrained model so it starts training a new synthesizer model from scratch. Please make sure you copied the entire contents of synthesizer/saved_models/logs-pretrained
to another "logs-XXXX" folder in the same location, and specify the name (XXXX) to synthesizer_train.py as the first argument.
I have a set of 5 seconds long flac files from a single speaker. Is it possible to train it without any transcript?
And also, I noticed that in this procedure the training of the encoder, audio and embedding is completely skipped (python encoder_preprocess.py
I'm trying to train on a voice (not from LibriSpeech)
Thank you
I guess transcript are needed, I wrote a script that does it automatically. I set it so the transcript is in the same format of 211-122425.trans.txt
I am getting this error while preprocessing the audio for my own dataset (not from LibriSpeech, it has no alignment)
Traceback (most recent call last):
File "synthesizer_preprocess_audio.py", line 59, in
@DereWah As you reasoned, the transcripts are necessary to train the synthesizer (which you can think of as a black box that converts text to mels). Make your folder in the same structure as https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-666099538 and it should work. In the future, please also include the python command that was run to facilitate troubleshooting. Good luck!
@DereWah check this comment
I think the no-alignments option will fix this error since I had that as well in the comment above. Also add your console/terminal command
This is the command I am using: python synthesizer_preprocess_audio.py synthesizer/saved_models/logs-singlespeaker/datasets_root -n 1 --no_alignment
Using no_alignment isn't changing anything. Also I am trying to run this on my GPU, I have cudart64_100.dll and I have installed requirements_gpu.txt
There must be a problem with generating the metadata, I think it will be fixed by making the folder in the same structure as in comment https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-666099538
And also, how should the utterance.txt file be formatted? Thank you for the help and the patience
@DereWah mine only have the line of text of the spoken line:
utterance-000.txt contains Mireska is here.
utterance-001.txt contains Are you ready to have some fun, ya?
etc.
It worked, I managed to generate those files into datasets_root\SV2TTS\synthesizer. Thank you :)
I am getting this error:
Traceback (most recent call last):
File "synthesizer_train.py", line 55, in
For some reasons there is \\ before the file name. The slashes are 2.
while running python synthesizer_train.py first_run dataset_root/SV2TTS/synthesizer --checkpoint_interval 100
Fixed, by using and editing the command in the readme: Now the pretrained synthesizer model can be trained on the reduced dataset. No changes to hparams are needed.
python synthesizer_train.py first_run synthesizer/saved_models/logs-first_run/datasets_root/SV2TTS/synthesizer --summary_interval 125 --checkpoint_interval 100
It is now saying No model to load at synthesizer/saved_models/logs-first_run\taco_pretrained
Generated 0 test batches of size 36 in 0.000 sec Generated 64 train batches of size 36 in 137.217 sec etc.
I guess it's training.
I noticed tho that my GPU (1060Ti ) is at 0%, while the CPU (i7+ 8th gen) is doing the work. Based on those specs, on which part should I run the training (to optimize time/consume)?
@DereWah
It is now saying No model to load at synthesizer/saved_models/logs-first_run\taco_pretrained
Make sure your logs-first_run folder is a copy of the logs-pretrained or you won't get a good output. In that folder are the checkpoints it's loading to then finetune on top of.
edit: you also don't have tensorflow gpu setup/running if your CPU is the one processing. You should be getting a dll warning of some sort indicating that you aren't using the GPU. Gpu is faster, but it isn't unreasonable to train to 400 steps with CPU.
I noticed tho that my GPU (1060Ti ) is at 0%, while the CPU (i7+ 8th gen) is doing the work.
The tensorflow 1.15 binaries provided by pip are only compatible with CUDA 10.0. GPU support for the synthesizer (the only part that relies on tensorflow) requires you get a proper nvidia driver version and cuda libraries. You will lose much more time setting that up, than you stand to gain in speedup for those 400 steps. (If it is 10 sec/step with the i7 and 1 sec/step with GPU, the speedup will only save an hour over 400 steps.)
I think I have already setup the CUDA 10.0, and when I run it tensorflow says success message about opening cudart64_100.dll . I have also followed this tutorial :https://poorlydocumented.com/2019/11/installing-corentinjs-real-time-voice-cloning-project-on-windows-10-from-scratch/ and I have cudnn64_7.dll . I just don't know how to activate the GPU. Also, from training with only 13 and a half minutes did you get mediocre or good results?
@DereWah The model, from the above CLI instruction --checkpoint_interval 100
only saves checkpoints ever 100 steps. So cancelling training after a fixed amount of time, if it hasn't reached a checkpoint and saved, won't change anything. You need to just check back with the terminal every once in a while and see if it's saved. The recommended amount from blue-fish is at least 200 steps (2 checkpoint saves).
@blue-fish Thank you for the help, I got it to work. However, the model I trained seems to work much better with longer input than shorter phrases. I think that the problem is that lack of shorter examples in the training set I used. Any suggestions for better training sets?
@Ori-Pixel Thanks for helping others get this to work.
@adfost In general the toolbox performs best for inputs of 10-20 words. What problems are you noticing? They might be inherited from the synthesizer architecture and pretrained models.
@blue-fish If I say something like "Thank you for using our product", it says "thank you" followed by a very long pause, and then the rest of the sentence.
I meant only 13 and a half minutes of datasets, not of overall time.
@adfost The gaps in spectrograms are a known issue (#53). I would also like to fix this but it's not easy: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-653996443 . To workaround the issue you can use the "enhance vocoder output" option to trim the silences if you have webrtcvad
installed.
@DereWah I've finetuned a voice using as little as 5 minutes of data. I would consider this to be the bare minimum, and 10-20 minutes to be generally adequate. Of course it also depends on how varied the dataset is, and how closely your target speaker matches one in the training set. The purpose of this issue is to discuss best practices for finetuning, please share what works well for you.
@adfost The gaps in spectrograms are a known issue (#53). I would also like to fix this but it's not easy: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-653996443 . To workaround the issue you can use the "enhance vocoder output" option to trim the silences if you have
webrtcvad
installed.@DereWah I've finetuned a voice using as little as 5 minutes of data. I would consider this to be the bare minimum, and 10-20 minutes to be generally adequate. Of course it also depends on how varied the dataset is, and how closely your target speaker matches one in the training set. The purpose of this issue is to discuss best practices for finetuning, please share what works well for you.
Thank you
Also do you know how to acivate the GPU instead of the CPU? I have the correct cuda versions and files. Thank you.
@blue-fish Where is the enhance vocoder output option?
@DereWah I have been unsuccessful in my own attempts to get Tensorflow GPU support, so I can't help you there. GPU support for pytorch is much easier. We'll have it for the synthesizer following #447 (if it ever gets done).
@adfost The "enhance vocoder output" feature is enabled when webrtcvad
is installed. For demo_cli.py, it is always active. If using demo_toolbox.py, click the checkbox on the right side of the toolbox UI: https://github.com/CorentinJ/Real-Time-Voice-Cloning/pull/432#issuecomment-660940390
@DereWah I have been unsuccessful in my own attempts to get Tensorflow GPU support, so I can't help you there. GPU support for pytorch is much easier. We'll have it for the synthesizer following #447 (if it ever gets done).
Understood. I'll just go with the CPU. Also while talking with a friend about the project we found out we got totally different outputs while trying it. Him, while using only the CLI on a colab was gettin .wav files without much noise or distortion.
Instead while I tried using the toolbox on my pc was getting distorted and "doubled" results. Both of use were using the same voices for embedding (not from LibriSpeech).
Maybe the CLI gives better results? Or maybe I'm having issues with the pc, idk.
(My files remained like that even while using the replay function.)
You probably don't have webrtcvad
on the PC unless you went out of your way to install it. We took it out of requirements.txt because it was causing grief (#375). But it will clean the audio before making a speaker embed with it.
For quality of the saved wavs using toolbox vs CLI they both use soundfile.write()
with resampling enabled. There could be some platform differences in libsndfile.
If you have some examples of the "distorted and doubled" wavs please open a new issue so we can investigate.
Summary
A relatively easy way to improve the quality of the toolbox output is through fine-tuning of the multispeaker pretrained models on a dataset of a single target speaker. Although it is no longer voice cloning, it is a shortcut for obtaining a single-speaker TTS model with less training data needed relative to training from scratch. This idea is not original, but a sample single-speaker model is presented along with a process and data for replicating the model.
Improvement in quality is obtained by taking the pretrained synthesizer model and training a few thousand steps on a single-speaker dataset. This amount of training can be done in less than a day on a CPU, and even faster with a GPU.
Procedure
Pretrained models and all files and commands needed to replicate this training can be found here: https://www.dropbox.com/s/bf4ti3i1iczolq5/logs-singlespeaker.zip?dl=0
Results
Download audio samples: samples.zip
These are generated with demo_toolbox.py are demonstrate the effect of synthesizer fine-tuning. "Pretrained" uses the original models, and "singlespeaker" uses the fine-tuned synthesizer model with the original vocoder model. I found the #432 changes helpful for benchmarking: all samples are generated with seed=1, no trim silences. The single-speaker model is noticeably better, with fewer long gaps and artifacts for short utterances. However, gaps still occur sometimes: one example is "this is a big red apple." Output is also somewhat better with a fine-tuned vocoder model, though no samples with the new vocoder are shared at this time.
Discussion
This work helps to demonstrate the following points:
The major obstacle preventing single-speaker fine-tuning is the lack of a suitable tool for creating a custom dataset. The existing preprocessing scripts are suited to batch processing of organized, labeled datasets. The existing scripts are not helpful unless the target speaker is already part of a supported dataset. The preprocessing does not need to be fully automated because a small dataset on the order of 100 utterances is sufficient for fine-tuning. I am going to write a tool that will allow users to manually select or record files to add to a custom dataset, and facilitate transcription (maybe using DeepSpeech). This tool will be hosted in a separate repository.
Acknowledgements