Closed pilnyjakub closed 3 years ago
Instead of updating the preprocessing scripts to support other datasets, we should write programs to reformat the datasets so they resemble LibriSpeech in folder structure.
I agree with blue-fish! I wanted to use the The M-AILABS Speech Dataset in italian. Is there an implementation for it?
@blue-fish So you want the data structure to be LibriSpeech/train-clean-100/spekers...
and then contain data from other datasets? Otherwise you'd have to edit those files (technically only encoder/config.py
and synthesizer/symbols.py
).
@pilnyjakub I got this error: Error opening 'D:\data\cloning\dataset\cv-corpus-7.0-2021-07-21\it\audio\train1\000507c663409bb1796c90a583f52209f5e08870240424f8eb0ac861e194a97c07bfe6ece88a7129d4c51 bc8a8895d61516a2dad5a89bdc45c8a216c0207bb49\0\common_voice_it_20041619.mp3': File contains data in an unknown format.
do I have to convert everything to flac? It seems quite expensive
@ireneb612 This is probably due to an older version of librosa, try upgrading it or try one of these https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/395, https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/214, https://github.com/CorentinJ/Real-Time-Voice-Cloning/issues/198
@pilnyjakub I trasformed all the files to flac format, I also got some problems with this Error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6 audio5: character maps to
This is due to the fact that the encoding for the italian language is "latin-1". After changing that the synthetizer_prepocessing_audio.py seems to work!
@blue-fish So you want the data structure to be
LibriSpeech/train-clean-100/spekers...
and then contain data from other datasets?
Yes, that's what I have in mind.
Ok, will update.
When running the voceder_preprocess.py I'm running into this error:
Arguments: datasets_root: D:\data\cloning\dataset\cv-corpus-7.0-2021-07-21\it\ model_dir: synthesizer\saved_models\italian_1\ hparams: no_trim: False cpu: False
{'allow_clipping_in_normalization': True, 'clip_mels_length': True, 'fmax': 7600, 'fmin': 55, 'griffin_lim_iters': 60, 'hop_size': 200, 'max_abs_value': 4.0, 'max_mel_frames': 900, 'min_level_db': -100, 'n_fft': 800, 'num_mels': 80, 'power': 1.5, 'preemphasis': 0.97, 'preemphasize': True, 'ref_level_db': 20, 'rescale': True, 'rescaling_max': 0.9, 'sample_rate': 16000, 'signal_normalization': True, 'silence_min_duration_split': 0.4, 'speaker_embedding_size': 256, 'symmetric_mels': True, 'synthesis_batch_size': 16, 'trim_silence': True, 'tts_cleaner_names': ['transliteration_cleaners'], 'tts_clip_grad_norm': 1.0, 'tts_decoder_dims': 128, 'tts_dropout': 0.5, 'tts_embed_dims': 512, 'tts_encoder_K': 5, 'tts_encoder_dims': 256, 'tts_eval_interval': 500, 'tts_eval_num_samples': 1, 'tts_lstm_dims': 1024, 'tts_num_highways': 4, 'tts_postnet_K': 5, 'tts_postnet_dims': 512, 'tts_schedule': [(2, 0.001, 20000, 12), (2, 0.0005, 40000, 12), (2, 0.0002, 80000, 12), (2, 0.0001, 160000, 12), (2, 3e-05, 320000, 12), (2, 1e-05, 640000, 12)], 'tts_stop_threshold': -3.4, 'use_lws': False, 'utterance_min_duration': 1.6, 'win_size': 800} Synthesizer using device: cpu Trainable Parameters: 30.870M
Loading weights at synthesizer\saved_models\italian_1\italian_1.pt
Tacotron weights loaded from step 25000
Using inputs from:
D:\data\cloning\dataset\cv-corpus-7.0-2021-07-21\it\SV2TTS\synthesizer\train.txt
D:\data\cloning\dataset\cv-corpus-7.0-2021-07-21\it\SV2TTS\synthesizer\mels
D:\data\cloning\dataset\cv-corpus-7.0-2021-07-21\it\SV2TTS\synthesizer\embeds
Found 161835 samples
Traceback (most recent call last):
File "D:\PycharmProjects\Realtime\vocoder_preprocess.py", line 59, in
I HAVE TRIED TO FIND A WAY TO AVOID USING THE LAMBDA FUNCTION BUT I COULD NOT FIND IT, PLEASE HELP ME DEAL WITH THIS ISSUE
@ireneb612 Please open a new issue to ask for help.
I'm going to close this issue as we will no longer provide support for any datasets not already incorporated into the repo. The end user is expected to reformat any dataset to resemble LibriSpeech/LibriTTS in structure.
@pilnyjakub If you would like to share dataset conversion scripts, please open a new issue.
Preprocess Data
I've written script to preprocess Mozilla Common Voice.
Also when running
synthesizer_preprocess_audio.py
include--no_alignments
argument.Of course you can use your own dataset, but don't forget the data structure (based on LibriSpeech):
I recommend converting audio files to
.flac
for better program functionality.Editing Files
Encoder
When using other audio format, change
.flac
inencoder/preprocess.py
to your format.Synthesizer
When using other audio format, add value to extensions in
synthesizer/preprocess.py
(formats recognized by default.wav
,.flac
,.mp3
)If you've trained custom encoder model, run
synthesizer_preprocess_embeds.py -e encoder/saved_models/<run_name>.pt
Wherever you're opening files, use the same encoding as your dataset. e.g.
encoding="utf-8"
(synthesizer/preprocess.py, synthesizer/train.py)synthesizer/hparams.py
synthesizer/symbols.py
Add chacters used by your language.Vocoder
If you've trained custom synthesizer model, run
vocoder_preprocess.py --model_dir synthesizer/saved_models/<run_name>/