CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.73k stars 8.8k forks source link

Report on Single Voice Training Results #832

Closed Tomcattwo closed 3 years ago

Tomcattwo commented 3 years ago

Hello @blue-fish and all, I am running the demo_toolbox on Win10, under Anaconda3 (run as administrator), env: VoiceClone, using an NVidia GEForce RTS2070 Super on an EVGA 08G-P4-3172-KR card, 8GB GDDR6, using python 3.7, pytorch Win10/CUDA version 11.1, with all other requirements met. The toolbox GUI (demo_toolbox.py) works fine on this setup.

My project is to use the toolbox to clone 15 voices from a computer simulation (to be able to add additional voice material (.wav files) in those voices back into the sim), one voice at a time, using the Single Voice method described in Issue #437 https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/437#issue-663639627

Method: For proof of principle, I built a custom single voice dataset for one of those voices, using:

  1. The instructions in Issue #437 ,
  2. the README.TXT file from the zip file provided by @blue-fish in #437, and
  3. this direction for developing the folder structure in LibriTTS format: Formatting from #437

This dataset (which I called V12F, a female voice) consists of 329 utterances (1.wav through 329.wav, 1.txt through 329.txt, a total of about 45 minutes worth of speech in 3-12 second increments) arranged using the LibriTTS method and schema, into folder ...dataset_root\LibriTTS\train-clean-100\speaker_012\book-001\

Preprocessing Audio and Embeds: I was able to successfully preprocess the single-voice data using synthesizer_preprocess_audio.py and synthesizer_preprocess_embeds.py, which produced proper audio, mels, embed, and training.txt data which properly went to "...datasets_root\SV2TTS\synthesizer" in folders "audio", "embeds", "mels" and file "train.txt", as it should.

First Synthesizer Training attempt (V12F): I then conducted synthesizer training (after correcting several issues (see below) for 20,000 steps, from scratch (took about 12 hours). Based upon the information saved during the training, I could tell that the output predicted voice wav and mels were a good representation of the speaker. Loss started ~4.9, and was at 0.125 when the 20000 steps were completed. I then ran it in the toolbox; the result was garbled output.

Issues encountered and Solutions:

  1. The README instructed use of --summary_interval 125 --checkpoint_interval 100 as arguments for synthesizer_train.py. pytorch would not accept --summary_interval 125 --checkpoint_interval 100 arguments. The only valid arguments (aside from the required run_id, syn_id and optional model_id and Force_Restart) allowed were --save_every xx (or -s SAVE_EVERY) --backup_every XXX (or -b BACKUP_EVERY) or --hparams. So, I used --save_every 100 as my argument

  2. Win10 "pickle" issue: synthesizer_train.py failed with "AttributeError: Can't pickle local object 'train..'", which I corrected as outlined in issue #669 , and code implemented from here: blue-fish@89a9964 This fix worked perfectly for me.

2nd Synthesizer Training on V12F: I conducted synthesizer training again, this time on top of "pretrained.pt" (already pretrained using LibriSpeech to 295K steps) for 2000 steps. Using LibriSpeech pretrained.pt, loss started at 0.4974 (295K steps) and rapidly went below 0.3200 (295,500 steps). I trained 2000 single-voice steps (which took just a bit over an hour). Total steps 297,000. Average steps per second was ~0.45 steps/sec. using batch size 12, r=2. Final loss was 0.2776. Synthesizer output mels, attention files and .wavs looked and sounded good, based on the samples saved during the training.

Toolbox Testing on V12F: The single voice pretrained synthesizer worked very well in the toolbox after playing ~20 random utterances from the V12F dataset in the toolbox for embeds, using "Enhance vocoder" and the Griffin-Lim vocoder output. I was quite pleased with the results. Audio samples are available here:

http://danforthhouse.com/files/V12F_Voice Samples.zip

Synthesizer Training on V13M: I then ran a second single voice dataset (V13M), a male voice. For training V13M, I put the original LibriSpeech 295K step pretrained synthesizer "pretrained.pt" file into a folder in ...\synthesizer\saved_models\V13M_LS_pretrained, renamed the file "V13M_LS_pretrained.pt", and referenced it in this way in the command:

python synthesizer_train.py V13M_LS_pretrained datasets_root/SV2TTS/synthesizer --save_every 100

and trained it for ~2000 steps on top of the pretrained.pt LibriSpeech 295K synthesizer. V13M training ran a bit faster (avg 0.55 steps per second or ~1.85 sec per step, 1944 steps per hour), likely because the total number of samples was less than V12F (of 329 total samples, only 325 were used in V13M because he (V13M) talks faster than she (Voice 12F) does, so it looks like 4 were lost due to being too short in duration for the hparams to pass them. V12F used 328 of 329 (I think one utterance was too long for the params). Both V12F and V13M use the same (or nearly all the same) text phrases.

Loss on V13M started at 0.4909 and rapidly fell to 0.3542 by the end of the 10th Epoch. At 1000 steps, loss = 0.3075. At 2000 steps, loss = 0.2940 and it took just a little over an hour to complete using the GPU, batch size 12, r=2.

Toolbox Testing on V13M: V13M was not as well vocoded as V12F. His voice sounded more like he had a bad sore throat. His (slight) southern (US) accent did not come through clearly. It was recognizable, but not nearly as good a quality as V12F. Voice samples are available here:

http://danforthhouse.com/files/V13M_Voice Samples.zip

Vocoder training: I also attempted vocoder preprocessing of the single voice pretrained synthesizer, and ran into several issues, which I will open in a new Issues thread. I could not get the vocoder_preprocess.py to work properly. Overall, after learning how to properly do single-voice training, I was pleased with the output. I can be a lot better, but should be fine for purposes of my project. Any recommendations on improving the voice quality (especially of V13M) would be appreciated. Regards, Tomcattwo

ghost commented 3 years ago

The README instructed use of --summary_interval 125 --checkpoint_interval 100 as arguments for synthesizer_train.py. pytorch would not accept --summary_interval 125 --checkpoint_interval 100 arguments.

As you found, the instructions in #437 are for use with the old tensorflow synthesizer that was in use at the time. The new code does not take an explicit command line argument to evaluate every X steps. Instead that is set with hparams.tts_eval_interval, which can be overridden at the command line.

The equivalent new command line arguments are:

--hparams "tts_eval_interval=125" --save_every 100

Any recommendations on improving the voice quality (especially of V13M) would be appreciated.

Tomcattwo commented 3 years ago

Thanks @blue-fish . Re suggestions: More training data: I used every utterance of that voice that the sim has (~1600 .wav files). Most are very short and I had to combine several together to get utterances for training between 2 and 11 seconds (I generally joined 5 short .wavs into a single .wav to get the training utterances) . Not much more I can do there. In toolbox, I can try running a lot more (maybe 100) of the voice for embeds before going after the output. Improve quality of training data for finetuning (remove low-quality and overly complex utterances from training set) I can see what can be done there. In the sim, this is an Air traffic Controller speaking over a radio so the quality is not great to begin with. 16bit mono PCM .wav files at 8k (telephony) quality. Perhaps I can clean them up using Goldwave or Audacity Start with a better baseline model: Which baseline pretrained model would you recommend over LibriSpeech 295k? Perhaps LibriTTS taco_pretrained from #437? Any others available?

I know you mentioned in #437 that you did not have much additional improvement pretraining the vocoder. I am going give that a try to see if I can get any better results - I haven't much to lose but some time.

In toolbox, I am using Enhance Vocoder and Griffin-Lim, which sounds much better than the pretrained vocoder. Any suggestions for an alternate vocoder? Thanks, TC2

Tomcattwo commented 3 years ago

I ran a single-voice synthesizer training on top of the taco_pretrained synthesizer from the @blue-fish zip file from #437 for 4000 steps on my V13M data. When I used it in the toolbox, I got nothing but gibberish. I verified that it did NOT build from scratch - the training added ~4050 steps starting with 20200 steps for a total of 24250 steps to the taco_pretrained LibriTTS synthesizer from the zip file.

I also finally got vocoder preprocessing and vocoder training done on the pretrained.pt vocoder from the original repo files (issue #833 ). I then used this vocoder along with my single voice pretrained (on the original LibriSpeech synthesizer from the repo) V13M synthesizer in the toolbox. After training ~20 random V13M .wavs for embeddings, I tried some phrases and was very pleasantly surprised with the result. V13M lost the "sore throat" he had using Griffin-Lim, and the voice sounded good enough for use in my project. The output sounded better WITHOUT using "Enhance vocoder output" than with it on. The single-voice trained vocoder used in conjunction with the single-voice trained synthesizer produced a pretty decent output when compared to the ground truth wavs. It was much better than using just the single-voice pretrained synthesizer with Griffin-Lim or the original pre-trained vocoder. That said, I still have to do quite a bit of "phoneme manipulation" to get pronunciations closer to real speech. But this will work for my project. Thanks @blue-fish for your help on this work. This issue is ready to close. Regards, Tomcattwo