CorentinJ / Real-Time-Voice-Cloning

Clone a voice in 5 seconds to generate arbitrary speech in real-time
Other
52.53k stars 8.78k forks source link

Training Tutorial #819

Closed pilnyjakub closed 3 years ago

pilnyjakub commented 3 years ago

Preprocess Data

I've written script to preprocess Mozilla Common Voice.

Also when running synthesizer_preprocess_audio.py include --no_alignments argument.

Of course you can use your own dataset, but don't forget the data structure (based on LibriSpeech):

<datasets_root>
    LibriSpeech
        train-clean-100
            <speaker>
                <book_id>
                  <utterance>
                  <utterance>
            <speaker>
                <book_id>
                  <utterance>
                <book_id>
                  <utterance>

I recommend converting audio files to .flac for better program functionality.

Editing Files

Encoder

When using other audio format, change .flac in encoder/preprocess.py to your format.

Synthesizer

When using other audio format, add value to extensions in synthesizer/preprocess.py (formats recognized by default .wav, .flac, .mp3)

If you've trained custom encoder model, run synthesizer_preprocess_embeds.py -e encoder/saved_models/<run_name>.pt

Wherever you're opening files, use the same encoding as your dataset. e.g. encoding="utf-8" (synthesizer/preprocess.py, synthesizer/train.py)

synthesizer/hparams.py

tts_cleaner_names = ["basic_cleaners"],

synthesizer/symbols.py Add chacters used by your language.

Vocoder

If you've trained custom synthesizer model, run vocoder_preprocess.py --model_dir synthesizer/saved_models/<run_name>/

ghost commented 3 years ago

Instead of updating the preprocessing scripts to support other datasets, we should write programs to reformat the datasets so they resemble LibriSpeech in folder structure.

ireneb612 commented 3 years ago

I agree with blue-fish! I wanted to use the The M-AILABS Speech Dataset in italian. Is there an implementation for it?

pilnyjakub commented 3 years ago

@blue-fish So you want the data structure to be LibriSpeech/train-clean-100/spekers... and then contain data from other datasets? Otherwise you'd have to edit those files (technically only encoder/config.py and synthesizer/symbols.py).

ireneb612 commented 3 years ago

@pilnyjakub I got this error: Error opening 'D:\data\cloning\dataset\cv-corpus-7.0-2021-07-21\it\audio\train1\000507c663409bb1796c90a583f52209f5e08870240424f8eb0ac861e194a97c07bfe6ece88a7129d4c51 bc8a8895d61516a2dad5a89bdc45c8a216c0207bb49\0\common_voice_it_20041619.mp3': File contains data in an unknown format.

do I have to convert everything to flac? It seems quite expensive

pilnyjakub commented 3 years ago

@ireneb612 This is probably due to an older version of librosa, try upgrading it or try one of these https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/395, https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/214, https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/198

ireneb612 commented 3 years ago

@pilnyjakub I trasformed all the files to flac format, I also got some problems with this Error:

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6 audio5: character maps to

This is due to the fact that the encoding for the italian language is "latin-1". After changing that the synthetizer_prepocessing_audio.py seems to work!

ghost commented 3 years ago

@blue-fish So you want the data structure to be LibriSpeech/train-clean-100/spekers... and then contain data from other datasets?

Yes, that's what I have in mind.

pilnyjakub commented 3 years ago

Ok, will update.

ireneb612 commented 3 years ago

When running the voceder_preprocess.py I'm running into this error:

Arguments: datasets_root: D:\data\cloning\dataset\cv-corpus-7.0-2021-07-21\it\ model_dir: synthesizer\saved_models\italian_1\ hparams: no_trim: False cpu: False

{'allow_clipping_in_normalization': True, 'clip_mels_length': True, 'fmax': 7600, 'fmin': 55, 'griffin_lim_iters': 60, 'hop_size': 200, 'max_abs_value': 4.0, 'max_mel_frames': 900, 'min_level_db': -100, 'n_fft': 800, 'num_mels': 80, 'power': 1.5, 'preemphasis': 0.97, 'preemphasize': True, 'ref_level_db': 20, 'rescale': True, 'rescaling_max': 0.9, 'sample_rate': 16000, 'signal_normalization': True, 'silence_min_duration_split': 0.4, 'speaker_embedding_size': 256, 'symmetric_mels': True, 'synthesis_batch_size': 16, 'trim_silence': True, 'tts_cleaner_names': ['transliteration_cleaners'], 'tts_clip_grad_norm': 1.0, 'tts_decoder_dims': 128, 'tts_dropout': 0.5, 'tts_embed_dims': 512, 'tts_encoder_K': 5, 'tts_encoder_dims': 256, 'tts_eval_interval': 500, 'tts_eval_num_samples': 1, 'tts_lstm_dims': 1024, 'tts_num_highways': 4, 'tts_postnet_K': 5, 'tts_postnet_dims': 512, 'tts_schedule': [(2, 0.001, 20000, 12), (2, 0.0005, 40000, 12), (2, 0.0002, 80000, 12), (2, 0.0001, 160000, 12), (2, 3e-05, 320000, 12), (2, 1e-05, 640000, 12)], 'tts_stop_threshold': -3.4, 'use_lws': False, 'utterance_min_duration': 1.6, 'win_size': 800} Synthesizer using device: cpu Trainable Parameters: 30.870M

Loading weights at synthesizer\saved_models\italian_1\italian_1.pt Tacotron weights loaded from step 25000 Using inputs from: D:\data\cloning\dataset\cv-corpus-7.0-2021-07-21\it\SV2TTS\synthesizer\train.txt D:\data\cloning\dataset\cv-corpus-7.0-2021-07-21\it\SV2TTS\synthesizer\mels D:\data\cloning\dataset\cv-corpus-7.0-2021-07-21\it\SV2TTS\synthesizer\embeds Found 161835 samples Traceback (most recent call last): File "D:\PycharmProjects\Realtime\vocoder_preprocess.py", line 59, in run_synthesis(args.in_dir, args.out_dir, args.model_dir, modified_hp) File "D:\PycharmProjects\Realtime\synthesizer\synthesize.py", line 75, in run_synthesis for i, (texts, mels, embeds, idx) in tqdm(enumerate(data_loader), total=len(data_loader)): File "D:\PycharmProjects\Realtime\venv\lib\site-packages\torch\utils\data\dataloader.py", line 359, in iter return self._get_iterator() File "D:\PycharmProjects\Realtime\venv\lib\site-packages\torch\utils\data\dataloader.py", line 305, in _get_iterator return _MultiProcessingDataLoaderIter(self) File "D:\PycharmProjects\Realtime\venv\lib\site-packages\torch\utils\data\dataloader.py", line 918, in init w.start() File "C:\Users\UserPC.LAPTOP-F0JUUKDE\AppData\Local\Programs\Python\Python39\lib\multiprocessing\process.py", line 121, in start self._popen = self._Popen(self) File "C:\Users\UserPC.LAPTOP-F0JUUKDE\AppData\Local\Programs\Python\Python39\lib\multiprocessing\context.py", line 224, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "C:\Users\UserPC.LAPTOP-F0JUUKDE\AppData\Local\Programs\Python\Python39\lib\multiprocessing\context.py", line 327, in _Popen return Popen(process_obj) File "C:\Users\UserPC.LAPTOP-F0JUUKDE\AppData\Local\Programs\Python\Python39\lib\multiprocessing\popen_spawn_win32.py", line 93, in init reduction.dump(process_obj, to_child) File "C:\Users\UserPC.LAPTOP-F0JUUKDE\AppData\Local\Programs\Python\Python39\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) AttributeError: Can't pickle local object 'run_synthesis..' Traceback (most recent call last): File "", line 1, in File "C:\Users\UserPC.LAPTOP-F0JUUKDE\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 116, in spawn_main exitcode = _main(fd, parent_sentinel) File "C:\Users\UserPC.LAPTOP-F0JUUKDE\AppData\Local\Programs\Python\Python39\lib\multiprocessing\spawn.py", line 126, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input

I HAVE TRIED TO FIND A WAY TO AVOID USING THE LAMBDA FUNCTION BUT I COULD NOT FIND IT, PLEASE HELP ME DEAL WITH THIS ISSUE

ghost commented 3 years ago

@ireneb612 Please open a new issue to ask for help.

ghost commented 3 years ago

I'm going to close this issue as we will no longer provide support for any datasets not already incorporated into the repo. The end user is expected to reformat any dataset to resemble LibriSpeech/LibriTTS in structure.

@pilnyjakub If you would like to share dataset conversion scripts, please open a new issue.