Imlementation of "Transfer Learning from Monolingual ASR to Transcription-free Cross-lingual Voice Conversion."
We provide our pretrained monolingual uni-directional acoustic model (in the ppg/ directory) and speaker encoder (in the spk_embedder/ directory) for reproduction of our multispeaker VC model. It may not generate the best result, but it's good enough.
All the VC data are from Voice Conversion Challenge 2020 and all the generated speech are submitted to the challenge for listening review, including intra-lingual and cross-lingual VC tasks.
Audio samples of our best model can be found here. For more details, please refer to our paper.
Clone this repository.
Access data from VCC 2020. Inside the "vcc2020_training" folder there should be 14 speakers, and in the "vcc2020_evaluation" folder there should be 4 source speakers.
Prepare training data for Waveglow vocoder.
python prepare_h5.py --mode 0 -vcc "path_to_vcc2020_training"
This would generate an h5 file that concatenates all the speech for each speaker.
Prepare training data for the conversion model.
python prepare_h5.py --mode 1
This would convert the speech into input features, d-vectors, and mel-spectrograms.
python train.py -c config_24k.json
The training would take a few days. Please be patient.
Modify common/hparams_spk.py for your desired checkpoint directory and hyperparameters. Be aware that the "n_symbols" can only be 72 or 514, depending on which feature you want to use.
Run the training script
python train_ppg2mel_spk.py
Ideally it takes a few days. We stopped at the 30k to 50kth checkpoint.
python convert_speech_vcc.py -vcc "path_to_vcc2020_evaluation" -ch "checkpoint_of_conversion_model" -m "ppg_model_you_used" -wg "waveglow_checkpoint" -o "vcc2020_evaluation/output_directory/"
converted wav files are in the output directory in the format of "target_source_wavname.wav"