Single speaker fine-tuning process and results

ghost commented 4 years ago

Summary

A relatively easy way to improve the quality of the toolbox output is through fine-tuning of the multispeaker pretrained models on a dataset of a single target speaker. Although it is no longer voice cloning, it is a shortcut for obtaining a single-speaker TTS model with less training data needed relative to training from scratch. This idea is not original, but a sample single-speaker model is presented along with a process and data for replicating the model.

Improvement in quality is obtained by taking the pretrained synthesizer model and training a few thousand steps on a single-speaker dataset. This amount of training can be done in less than a day on a CPU, and even faster with a GPU.

Procedure

Pretrained models and all files and commands needed to replicate this training can be found here: https://www.dropbox.com/s/bf4ti3i1iczolq5/logs-singlespeaker.zip?dl=0

First, create a dataset of a single speaker from LibriSpeech. All embeddings are updated to reference the same file. (I'm not sure if this helps or not, but the idea is to get it to converge faster.)
- It doesn't have to be LibriSpeech. This demonstrates the concept with minimal changes to existing files.
- Total of 13.28 minutes (train-clean-100/211/122425/*)
Next, continue training of the pretrained synthesizer model using the restricted dataset. Running overnight on a CPU, loss decreased from 0.70 to 0.50 over 2,600 steps. I plan to go further in subsequent tests.
Generate new training data for the vocoder using the updated synthesizer model.
Continue training of the pretrained vocoder. I only added 1,000 steps for now because I was eager to see if it worked, but the difference is noticeable even with a little fine-tuning.

Results

Download audio samples: samples.zip

These are generated with demo_toolbox.py are demonstrate the effect of synthesizer fine-tuning. "Pretrained" uses the original models, and "singlespeaker" uses the fine-tuned synthesizer model with the original vocoder model. I found the #432 changes helpful for benchmarking: all samples are generated with seed=1, no trim silences. The single-speaker model is noticeably better, with fewer long gaps and artifacts for short utterances. However, gaps still occur sometimes: one example is "this is a big red apple." Output is also somewhat better with a fine-tuned vocoder model, though no samples with the new vocoder are shared at this time.

Discussion

This work helps to demonstrate the following points:

Deficiencies with the synthesizer and its pretrained model can be compensated to some extent, by fine-tuning to a single speaker. This is much easier than implementing a new synthesizer and requires far less training.
A small dataset of 0.2 hours is sufficient for fine-tuning the synthesizer.
Better single-speaker performance can be obtained with just a few thousand steps of additional synthesizer training.

The major obstacle preventing single-speaker fine-tuning is the lack of a suitable tool for creating a custom dataset. The existing preprocessing scripts are suited to batch processing of organized, labeled datasets. The existing scripts are not helpful unless the target speaker is already part of a supported dataset. The preprocessing does not need to be fully automated because a small dataset on the order of 100 utterances is sufficient for fine-tuning. I am going to write a tool that will allow users to manually select or record files to add to a custom dataset, and facilitate transcription (maybe using DeepSpeech). This tool will be hosted in a separate repository.

Acknowledgements

@CorentinJ (for the toolbox and original models)
@matheusfillipe (for the #402 features which make the toolbox much more usable for these experiments)
@mbdash (for asking questions in #433 that inspired me to try this)
@plummet555 (for support on #384 to make the toolbox deterministic, helps a lot with benchmarking)
@pusalieth (for #331 to make toolbox work on CPU)

ghost commented 4 years ago

Pretrained synthesizer + 200 steps of training on VCTK p240 samples (0.34 hours of speech). Still using the original vocoder model. This is just a few minutes of CPU time for fine-tuning. It is remarkable that the synthesizer is already imparting the accent on the result. This is good news for anyone who is fine-tuning an accent: it should not take too long, even for multispeaker.

I did notice a lot more gaps and sound artifacts than usual with the finetuned model (this result is cherry-picked). Is it because I did not hardcode all the samples to a single utterance embedding?

samples_vctkp240_200steps.zip

ghost commented 4 years ago

Single-speaker finetuning using VCTK dataset: samples_vctkp240.zip

Here are some samples from the latest experiment. VCTK p240 is used to add 4.4k steps to the synthesizer, and 1.0k to the vocoder. Synthesized audios have filename speaker_utterance_SYN_VOC.wav and use all combinations of pretrained ("pre") and finetuned ("fin") models for the synthesizer and vocoder, respectively.

Synthesized utterances using speaker p240's hardcoded embedding (derived from p240_001_mic1.flac) show the success in finetuning to match the voice, including the accent. Samples made from speaker p260's embedding demonstrate how much quality is lost when finetuning a single-speaker model.

In these samples, the synthesizer has far more impact on quality, though this result could be due to insufficient finetuning of the vocoder. Though the finetuned vocoder has only a slight advantage over the original for p240, it severely degrades voice cloning quality for p260.

Also compare to the samples for p240 and p260 in the Google SV2TTS paper: https://google.github.io/tacotron/publications/speaker_adaptation/

Replicating this experiment

Here is a preprocessed p240 dataset if you would like to repeat this experiment. The embeds for utterances 002-380 are overwritten with the one for 001, as the hardcoding makes for a more consistent result. Use the audio file p240_001.flac to generate embeddings for inference. The audios are not included to keep the file size down, so if you care to do vocoder training you will need to get and preprocess VCTK.

Directions:

Copy the folder synthesizer/saved_models/logs-pretrained to logs-vctkp240 in the same location. This will make a copy of your pretrained model to be finetuned.
Unzip the dataset files to datasets_p240 in your Real-Time-Voice-Cloning folder (or somewhere else if you desire)
Train the model: python synthesizer_train.py vctkp240 dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100
Let it run for 200 to 400 iterations, then stop the program.
- This should complete in a reasonable amount of time even on CPU.
- You can safely stop and resume training at any time though you will lose all progress since the last checkpoint
Test the finetuned model in the toolbox using dataset_p240/p240_001.flac to generate the embedding

mbdash commented 4 years ago

Wow that is amazing... I only asked your opinion and you actually did it!

The difference is incredible.

Now I just need to dumb down all you wrote to be able to reproduce it.

Also try your_input_text.replace('hi', 'eye') it is a little cheat that I find gives better results currently. At least in the multi speaker model.

ghost commented 4 years ago

Now I just need to dumb down all you wrote to be able to reproduce it.

@mbdash In the first post I included a dropbox link that has fairly detailed instructions for the single-speaker LibriSpeech example. You can try that and ask if you have any trouble reproducing the results. If you want VCTKp240 I can make a zip file for you tomorrow.

This was much easier and faster than expected. I am sharing the results to generate interest, so we can collaborate on how much training is needed, best values of hparams, etc.

mbdash commented 4 years ago

Thank you, I will look at it tomorrow morning I am only staying up for a few more minutes, I am a bit too tired to think straight right now..

Tonight I am trying to keep it simple and see if I can Jam a regular "hand modeled" 3D head mesh into VOCA (Voice Operated Character Animation) (another GitHub project)

Update: nope it exploded.

ghost commented 4 years ago

Some general observations to share:

Finetuning improves both quality and similarity with the target voice, and transfers accent.
Decent single-speaker models require as little as 5 min of audio and 400 steps of synthesizer training.
Finetuning the vocoder is not as impactful as finetuning the synthesizer. In fact given the quality limitations of the underlying models (see #411) I would not bother with additional vocoder training.

Also I did another experiment and trained the synthesizer for about 5,000 additional steps on the entire VCTK dataset (trying to help out on #388). The accent still does not transfer for zero-shot cloning. I suspect the synthesizer needs to be trained from scratch if that is the goal.

P.S. @mbdash I updated the VCTKp240 post with a single-speaker dataset if you would like to try that out. https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-663308789

ghost commented 4 years ago

Also I did another experiment and trained the synthesizer for about 5,000 additional steps on the entire VCTK dataset (trying to help out on #388). The accent still does not transfer for zero-shot cloning. I suspect the synthesizer needs to be trained from scratch if that is the goal.

Changing my mind on training from scratch, I think we just need to add an extra input parameter to the synthesizer which indicates the accent or more accurately the dataset that it is trained on. A simple implementation might be a single bit representing LibriSpeech or VCTK. Next, finetune the existing models on VCTK with the added parameter. Then for inference specify the dataset that you want the result to sound like. I'm at a loss how to implement this with the current set of models, but I think this repo will have clues: https://github.com/Tomiinek/Multilingual_Text_to_Speech

I'm all done with accent experiments for now but I hope this is helpful to anyone who wants to continue this work.

Adam-Mortimer commented 4 years ago

"I am going to write a tool that will allow users to manually select or record files to add to a custom dataset, and facilitate transcription (maybe using DeepSpeech). This tool will be hosted in a separate repository."

Thank you for all your hard work on this repo - even as an almost complete newcomer to deep learning, I've been able to decipher some things, but I'm still stymied by the inability to create custom datasets from scratch. Are you still working on this "custom dataset" tool that you mention here?

Ori-Pixel commented 4 years ago

@blue-fish any reason why im getting the following error: "synthesizer_train.py: error: the following arguments are required: synthesizer_root"? I'm trying to run:

synthesizer_train.py H:\ttss\Real-Time-Voice-Cloning-master\dataset_p240\SV2TTS\synthesizer --checkpoint_interval 100

the second argument is the folder that contains embeds, mels, and train.txt

nevermind I fixed it while writing this. The argument isn't --synthesizer_root as all of the other arguments, but actually just synthesizer_root. Also, the above testing instructions are thus wrong (or at least not working for me). The command should be:

python synthesizer_train.py synthesizer_root dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100

(it at least bumped me to a dll error - still working through that one)

ghost commented 4 years ago

I'm still stymied by the inability to create custom datasets from scratch. Are you still working on this "custom dataset" tool that you mention here?

Hi @Adam-Mortimer. The custom dataset tool is still planned, but currently on hold as I've just started working #447 (switching out the synthesizer for fatchord's tacotron). #447 will be bigger than all of my existing pull requests combined if it ever gets finished. In other words, it's going to take quite some time.

I started writing the custom dataset tool for a voice cloning experiment. I didn't get very far with the tool before I added LibriTTS support in #441 which made it much easier to create a dataset by putting your data in this kind of directory structure:

datasets_root
    * LibriTTS
        * train-clean-100
            * speaker-001
                * book-001
                    * utterance-001.wav
                    * utterance-001.txt
                    * utterance-002.wav
                    * utterance-002.txt
                    * utterance-003.wav
                    * utterance-003.txt

Where each utterance-###.wav is a short utterance (2-10 sec) and the utterance-###.txt contains the corresponding transcript. Then you can process this dataset using:

python synthesizer_preprocess_audio.py datasets_root --datasets_name LibriTTS --subfolders train-clean-100 --no_alignments

When this completes, your dataset is in the SV2TTS format and subsequent preprocessing commands (synthesizer_preprocess_embeds.py, vocoder_preprocess.py) will work as described on the training wiki page.

I would still like to write the custom dataset tool but I think #447 is a more pressing matter since the toolbox is incompatible with Python 3.8 due to our reliance on Tensorflow 1.x.

ghost commented 4 years ago

@Ori-Pixel There was a problem with my command and I fixed it. If you are following everything to the letter it should be:

python synthesizer_train.py vctkp240 dataset_p240/SV2TTS/synthesizer --checkpoint_interval 100

Where the first arg vctkp240 describes the path to the model you are training (in this case, it tells python to look for the model in synthesizer/saved_models/logs-vctkp240), and the second arg is the path to the location containing train.txt, and the mels and embeds folders. Please share your results and feel free to ask for help if you get stuck.

Ori-Pixel commented 4 years ago

@blue-fish thanks. yeah, I can see that it's saving to a new directory, I'll run it again with the correct params and post results.

Also, thanks for the preprocessing tips you gave to @Adam-Mortimer . I was not looking forward to custom labeling, but it doesn't seem that bad if I only have ~200 lines/~34 minutes. I'm trying to make a fake (semi-Gaelic) accent video game character say some lines, so I'll probably scrape the audio files from the wiki site, then slap them into a folder structure like above with a simple script and then run this single speaker fine tuning again. And for the accent, I think I can just find a semi-close one in the VCTK dataset (although a 10Gb download will take me a few days sadly).

ghost commented 4 years ago

@Ori-Pixel If you have a GPU you can quickly run a few experiments to see how far you can trim the dataset before the audio quality breaks down. Simply delete lines from train.txt and they won't be used.

One of my experiments involved re-recording some of the VCTK p240 utterances with a different voice. 5 minutes of mediocre data (80 utterances) still resulted in a half-decent model. If the labeling is extremely tedious you can try training a model on part of it while continuing to label.

I have preprocessed VCTK, if you can make your decision based on a single recording, request up to 3 speakers and I'll put them on dropbox for you. https://www.dropbox.com/s/6ve00tjjaab4aqj/VCTK_samples.zip?dl=0

ghost commented 4 years ago

Oh, and just to be clear, you cannot train the voice and accent independently at this time. The accent is associated with the voice via the speaker embedding. After #447, we will work on #230 to add the Mozilla TTS implementation of GSTs. That should allow us to generalize accents to new voices.

Ori-Pixel commented 4 years ago

@blue-fish

I have preprocessed VCTK, if you can make your decision based on a single recording, request up to 3 speakers and I'll put them on dropbox for you. https://www.dropbox.com/s/6ve00tjjaab4aqj/VCTK_samples.zip?dl=0

Is there a list of their speakers somewhere? I only was able to find the 10GB file with not even a magnet link or anything denoting samples or file structure. I mean realistically anything Irish, Scottish, or Gaelic would work. I may also look into downloading it direct to drive (if possible) and even possibly training there (if possible -- as far as I'm aware you can mount the drive and run bash.)

Oh, and just to be clear, you cannot train the voice and accent independently at this time. The accent is associated with the voice via the speaker embedding. After #447, we will work on #230 to add the Mozilla TTS implementation of GSTs. That should allow us to generalize accents to new voices.

Yeah, I just meant using a vctk pretrained that wasn't horribly inconsistent with my single speaker's accent and then fine tuning with my custom labeled lines on top.

I also have a couple idle GPUs in my machine but I always run into venv issues with gpu training so I'll just use colab if I really need a GPU. Too bar downloading from a link to

ghost commented 4 years ago

Is there a list of their speakers somewhere?

The zip file I uploaded includes speaker-log.txt (which is included in the full VCTK dataset) which has a list of speaker metadata such as:

ID  AGE  GENDER  ACCENTS  REGION COMMENTS 
p225  23  F    English    Southern  England
p226  22  M    English    Surrey
p227  38  M    English    Cumbria
p228  22  F    English    Southern  England

Ori-Pixel commented 4 years ago

Ah I see. I'll give it a look tomorrow along with the results and let you know then, thanks again for being so active!

Ori-Pixel commented 4 years ago

@blue-fish p261 is relatively close. if I could get that slice, that would be very helpful (my internet at my current house is sadly 1MB/s)

I trained as per instructions above, sadly I didnt get to see the console output as my power went out after about an hour or so, but I did get this in the training logs, so I think this is as far as it trained.

[2020-07-30 01:36:31.676] Step 278202 [28.894 sec/step, loss=0.64379, avg_loss=0.64339]

Also, just to make sure I did the test, this is the cmd I used:

python demo_toolbox.py -d H:\ttss\Real-Time-Voice-Cloning-master\dataset_p240

Where random seed = 1, enhanced vocoder output is checked, embedding was from p240_1.flac.

resulting audio: https://raw.githubusercontent.com/Ori-Pixel/files/master/welcome_to_toolbox_fine_tuned.flac

ghost commented 4 years ago

@Ori-Pixel

Here is the dataset in the same format as p240 (embeds overwritten with the one corresponding to p261_001.flac): https://www.dropbox.com/s/o6fz2r6w56djwkf/dataset_p261.zip?dl=0

The source p261 dataset so you can listen to the audios: https://www.dropbox.com/s/ynf823o5619j2q5/p261.zip?dl=0
The processed audio for vocoder training (put this in SV2TTS/synthesizer/audio): https://www.dropbox.com/s/q3bihpem7os54yi/p261_audio.zip?dl=0
The original embeds for the full set (you should not have any use for these except to perform synthesizer training experiments): https://www.dropbox.com/s/y012fvf0zyk50xg/p261_embeds.zip?dl=0

resulting audio: https://raw.githubusercontent.com/Ori-Pixel/files/master/welcome_to_toolbox_fine_tuned.flac

Your results sound American to me. Check that you are using the new synthesizer model, then try this text: Take a look at these pages for crooked creek drive. And compare to my results for 200 steps: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-663050677

Ori-Pixel commented 4 years ago

Check that you are using the new synthesizer model

Ah, I didn't have that drop down selected. My results are then this, with the same settings:

https://raw.githubusercontent.com/Ori-Pixel/files/master/take%20a%20look%20at%20these%20pages%20for%20crooked%20creek%20drive%20fine%20tuned.flac

I'm also taking your comment above and trying to train my own dataset, but at first I got a dataset roots folder doesnt exist error, so I made the folder and added my files, but when I go to train, I get:

Arguments:
datasets_root:   datasets_root
out_dir:         datasets_root\SV2TTS\synthesizer
n_processes:     None
skip_existing:   False
hparams:
no_alignments:   False
datasets_name:   LibriTTS
subfolders:      train-clean-100
Using data from:
datasets_root\LibriTTS\train-clean-100
LibriTTS:   0%|                                                                            | 0/1 [00:00<?, ?speakers/s]2

gpu warnings here

LibriTTS: 100%|████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.20s/speakers]
The dataset consists of 0 utterances, 0 mel frames, 0 audio timesteps (0.00 hours).
Traceback (most recent call last):
  File "synthesizer_preprocess_audio.py", line 59, in <module>
    preprocess_dataset(**vars(args))
  File "H:\ttss\Real-Time-Voice-Cloning-master\synthesizer\preprocess.py", line 49, in preprocess_dataset
    print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata))
ValueError: max() arg is an empty sequence

Utterances have just the text that was spoken in them, so utterance-000.txt contains Let's have some fun, shall we...

edit: I assume I will need to go through the training docs and start by training the encoder?

ghost commented 4 years ago

@Ori-Pixel You also need to add the --no_alignments option to use a non-LibriSpeech dataset that doesn't have an alignments file. I've also fixed the command in the instructions above. Sorry for leaving that out earlier.

python synthesizer_preprocess_audio.py datasets_root --datasets_name LibriTTS --subfolders train-clean-100 --no_alignments

Edit: If preprocessing completes without finding a wav file, we should remind the user to pass the --no_alignments flag. Or possibly default it to True if the datasets_name is not LibriSpeech.

Ori-Pixel commented 4 years ago

@blue-fish Okay, so I got it to train, and I can also train my own dataset for the synthesizer. Really thankful for the help. Here's a result from 200 steps of training if you're interested:

https://raw.githubusercontent.com/Ori-Pixel/files/master/crooked_creek_dw.flac

https://raw.githubusercontent.com/Ori-Pixel/files/master/biggest_oversight.flac

ghost commented 4 years ago

@Ori-Pixel Nice! It's remarkable how much that voice comes through after 200 steps of finetuning. In my own experiments going up to 400 steps yields a noticeable improvement in the voice quality. More than 400 doesn't seem to help, though it doesn't hurt either.

Edit: You trained on CPU right? How long did it take?

Ori-Pixel commented 4 years ago

@blue-fish I did train on CPU(autocorrect!!) (I always have issues with gpu setup. Luckily im building a new pc when the 30xx cards drop with the new zen2 amd cpus). After trying to train from 200-400 it would seem that it takes ~25s per step after 20 steps, so around 2 hours for 200 steps on i5 4690k.

The next steps for me would be encoder/vocoder training but I don't want to invest the compute power since Im working on another NLP problem for my actual research (sentiment analysis) I'll let it run overnight again and this time see how far it gets :)

edit: as @blue-fish said, it seems training it to 400 steps made a large difference. Here's an example of the same voice as above, but with 400 steps of training the p261 set on my own collected voice samples:

original voice: https://raw.githubusercontent.com/Ori-Pixel/files/master/Vo_dark_willow_sylph_attack_14.mp3 200 steps: https://raw.githubusercontent.com/Ori-Pixel/files/master/biggest_oversight.flac 400 steps: https://raw.githubusercontent.com/Ori-Pixel/files/master/dark%20willow%20400.flac

adfost commented 4 years ago

@blue-fish I did exactly what you said, after over 10000 steps with the synthesizer, I try to open the toolbox. I type the text to convert into the box, and I get some unrelated text in an almost incomprehensible ramble.

ghost commented 4 years ago

@adfost Which set of instructions are you following? LibriSpeech (https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issue-663639627) or VCTKp240 (https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-663308789)?

Most likely, when you run synthesizer_train.py it cannot find the pretrained model so it starts training a new synthesizer model from scratch. Please make sure you copied the entire contents of synthesizer/saved_models/logs-pretrained to another "logs-XXXX" folder in the same location, and specify the name (XXXX) to synthesizer_train.py as the first argument.

DereWah commented 4 years ago

I have a set of 5 seconds long flac files from a single speaker. Is it possible to train it without any transcript?

And also, I noticed that in this procedure the training of the encoder, audio and embedding is completely skipped (python encoder_preprocess.py ) without leaving any data to restrict synthesizer training. [I'm new to deep learning and this is what I have "deciphered" from this guide. Maybe I've gotten something wrong idk]

I'm trying to train on a voice (not from LibriSpeech)

Thank you

I guess transcript are needed, I wrote a script that does it automatically. I set it so the transcript is in the same format of 211-122425.trans.txt

DereWah commented 4 years ago

I am getting this error while preprocessing the audio for my own dataset (not from LibriSpeech, it has no alignment) Traceback (most recent call last): File "synthesizer_preprocess_audio.py", line 59, in preprocess_dataset(**vars(args)) File "D:\Python\Real-Time-Voice-Cloning\synthesizer\preprocess.py", line 49, in preprocess_dataset print("Max input length (text chars): %d" % max(len(m[5]) for m in metadata)) ValueError: max() arg is an empty sequence

ghost commented 4 years ago

@DereWah As you reasoned, the transcripts are necessary to train the synthesizer (which you can think of as a black box that converts text to mels). Make your folder in the same structure as https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-666099538 and it should work. In the future, please also include the python command that was run to facilitate troubleshooting. Good luck!

Ori-Pixel commented 4 years ago

@DereWah check this comment

I think the no-alignments option will fix this error since I had that as well in the comment above. Also add your console/terminal command

DereWah commented 4 years ago

This is the command I am using: python synthesizer_preprocess_audio.py synthesizer/saved_models/logs-singlespeaker/datasets_root -n 1 --no_alignment

Using no_alignment isn't changing anything. Also I am trying to run this on my GPU, I have cudart64_100.dll and I have installed requirements_gpu.txt

There must be a problem with generating the metadata, I think it will be fixed by making the folder in the same structure as in comment https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issuecomment-666099538

DereWah commented 4 years ago

And also, how should the utterance.txt file be formatted? Thank you for the help and the patience

Ori-Pixel commented 4 years ago

@DereWah mine only have the line of text of the spoken line: utterance-000.txt contains Mireska is here. utterance-001.txt contains Are you ready to have some fun, ya? etc.

DereWah commented 4 years ago

It worked, I managed to generate those files into datasets_root\SV2TTS\synthesizer. Thank you :)

DereWah commented 4 years ago

I am getting this error:

Traceback (most recent call last): File "synthesizer_train.py", line 55, in tacotron_train(args, log_dir, hparams) File "D:\Python\Real-Time-Voice-Cloning\synthesizer\train.py", line 392, in tacotron_train return train(log_dir, args, hparams) File "D:\Python\Real-Time-Voice-Cloning\synthesizer\train.py", line 144, in train feeder = Feeder(coord, metadat_fpath, hparams) File "D:\Python\Real-Time-Voice-Cloning\synthesizer\feeder.py", line 28, in init with open(metadata_filename, encoding="utf-8") as f: FileNotFoundError: [Errno 2] No such file or directory: 'dataset_root/SV2TTS/synthesizer\\train.txt'

For some reasons there is \\ before the file name. The slashes are 2.

while running python synthesizer_train.py first_run dataset_root/SV2TTS/synthesizer --checkpoint_interval 100

DereWah commented 4 years ago

Fixed, by using and editing the command in the readme: Now the pretrained synthesizer model can be trained on the reduced dataset. No changes to hparams are needed.

python synthesizer_train.py first_run synthesizer/saved_models/logs-first_run/datasets_root/SV2TTS/synthesizer --summary_interval 125 --checkpoint_interval 100

It is now saying No model to load at synthesizer/saved_models/logs-first_run\taco_pretrained

Generated 0 test batches of size 36 in 0.000 sec Generated 64 train batches of size 36 in 137.217 sec etc.

I guess it's training.

I noticed tho that my GPU (1060Ti ) is at 0%, while the CPU (i7+ 8th gen) is doing the work. Based on those specs, on which part should I run the training (to optimize time/consume)?

Ori-Pixel commented 4 years ago

@DereWah

It is now saying No model to load at synthesizer/saved_models/logs-first_run\taco_pretrained

Make sure your logs-first_run folder is a copy of the logs-pretrained or you won't get a good output. In that folder are the checkpoints it's loading to then finetune on top of.

edit: you also don't have tensorflow gpu setup/running if your CPU is the one processing. You should be getting a dll warning of some sort indicating that you aren't using the GPU. Gpu is faster, but it isn't unreasonable to train to 400 steps with CPU.

ghost commented 4 years ago

I noticed tho that my GPU (1060Ti ) is at 0%, while the CPU (i7+ 8th gen) is doing the work.

The tensorflow 1.15 binaries provided by pip are only compatible with CUDA 10.0. GPU support for the synthesizer (the only part that relies on tensorflow) requires you get a proper nvidia driver version and cuda libraries. You will lose much more time setting that up, than you stand to gain in speedup for those 400 steps. (If it is 10 sec/step with the i7 and 1 sec/step with GPU, the speedup will only save an hour over 400 steps.)

DereWah commented 4 years ago

I think I have already setup the CUDA 10.0, and when I run it tensorflow says success message about opening cudart64_100.dll . I have also followed this tutorial :https://poorlydocumented.com/2019/11/installing-corentinjs-real-time-voice-cloning-project-on-windows-10-from-scratch/ and I have cudnn64_7.dll . I just don't know how to activate the GPU. Also, from training with only 13 and a half minutes did you get mediocre or good results?

Ori-Pixel commented 4 years ago

@DereWah The model, from the above CLI instruction --checkpoint_interval 100 only saves checkpoints ever 100 steps. So cancelling training after a fixed amount of time, if it hasn't reached a checkpoint and saved, won't change anything. You need to just check back with the terminal every once in a while and see if it's saved. The recommended amount from blue-fish is at least 200 steps (2 checkpoint saves).

adfost commented 4 years ago

@blue-fish Thank you for the help, I got it to work. However, the model I trained seems to work much better with longer input than shorter phrases. I think that the problem is that lack of shorter examples in the training set I used. Any suggestions for better training sets?

ghost commented 4 years ago

@Ori-Pixel Thanks for helping others get this to work.

@adfost In general the toolbox performs best for inputs of 10-20 words. What problems are you noticing? They might be inherited from the synthesizer architecture and pretrained models.

adfost commented 4 years ago

@blue-fish If I say something like "Thank you for using our product", it says "thank you" followed by a very long pause, and then the rest of the sentence.

DereWah commented 4 years ago

I meant only 13 and a half minutes of datasets, not of overall time.

ghost commented 4 years ago

@adfost The gaps in spectrograms are a known issue (#53). I would also like to fix this but it's not easy: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-653996443 . To workaround the issue you can use the "enhance vocoder output" option to trim the silences if you have webrtcvad installed.

@DereWah I've finetuned a voice using as little as 5 minutes of data. I would consider this to be the bare minimum, and 10-20 minutes to be generally adequate. Of course it also depends on how varied the dataset is, and how closely your target speaker matches one in the training set. The purpose of this issue is to discuss best practices for finetuning, please share what works well for you.

DereWah commented 4 years ago

@adfost The gaps in spectrograms are a known issue (#53). I would also like to fix this but it's not easy: https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/364#issuecomment-653996443 . To workaround the issue you can use the "enhance vocoder output" option to trim the silences if you have webrtcvad installed.

@DereWah I've finetuned a voice using as little as 5 minutes of data. I would consider this to be the bare minimum, and 10-20 minutes to be generally adequate. Of course it also depends on how varied the dataset is, and how closely your target speaker matches one in the training set. The purpose of this issue is to discuss best practices for finetuning, please share what works well for you.

Thank you

Also do you know how to acivate the GPU instead of the CPU? I have the correct cuda versions and files. Thank you.

adfost commented 4 years ago

@blue-fish Where is the enhance vocoder output option?

ghost commented 4 years ago

@DereWah I have been unsuccessful in my own attempts to get Tensorflow GPU support, so I can't help you there. GPU support for pytorch is much easier. We'll have it for the synthesizer following #447 (if it ever gets done).

@adfost The "enhance vocoder output" feature is enabled when webrtcvad is installed. For demo_cli.py, it is always active. If using demo_toolbox.py, click the checkbox on the right side of the toolbox UI: https://github.com/CorentinJ/Real-Time-Voice-Cloning/pull/432#issuecomment-660940390

DereWah commented 4 years ago

@DereWah I have been unsuccessful in my own attempts to get Tensorflow GPU support, so I can't help you there. GPU support for pytorch is much easier. We'll have it for the synthesizer following #447 (if it ever gets done).

Understood. I'll just go with the CPU. Also while talking with a friend about the project we found out we got totally different outputs while trying it. Him, while using only the CLI on a colab was gettin .wav files without much noise or distortion.

Instead while I tried using the toolbox on my pc was getting distorted and "doubled" results. Both of use were using the same voices for embedding (not from LibriSpeech).

Maybe the CLI gives better results? Or maybe I'm having issues with the pc, idk.

(My files remained like that even while using the replay function.)

ghost commented 4 years ago

You probably don't have webrtcvad on the PC unless you went out of your way to install it. We took it out of requirements.txt because it was causing grief (#375). But it will clean the audio before making a speaker embed with it.

For quality of the saved wavs using toolbox vs CLI they both use soundfile.write() with resampling enabled. There could be some platform differences in libsndfile.

If you have some examples of the "distorted and doubled" wavs please open a new issue so we can investigate.

CorentinJ / Real-Time-Voice-Cloning