This repo contains code accompanying the paper "Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis", which is implemented based on ming024/FastSpeech2 (much thanks!).
2022-06-15 Update: This work has been accepted to Interspeech 2022.
pip3 install -r requirements.txt
Please refer to ming024/FastSpeech2 for more details.
For example,
first run
python3 prepare_align.py config/AISHELL3/preprocess.yaml
Then download TextGrid Files or use MFA to align the corpus, and put TextGrid Files in your [PREPROCESSED_DATA_PATH] like preprocessed_data/AISHELL3/TextGrid/.
Finally, run the preprocessing script
python3 preprocess.py config/AISHELL3/preprocess.yaml
In addition:
Train the model
python3 train.py -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml
Noted: If you find the PhnCls Loss doesn't seem to be trending down or is not noticeable, try manually adjusting the symbol dicts in text/symbols.py (only contains relevant phonemes) to make phoneme classification work better, and this may solve the problem.
(Optional) Use tensorboard
tensorboard --logdir output/log/AISHELL3
For batch
python3 synthesize.py --source synbatch_chinese.txt --restore_step 250000 --mode batch -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml
For single
# For Mandarin
python3 synthesize.py --text "清华大学人机语音交互实验室,聚焦人工智能场景下的智能语音交互技术研究。" --ref [REF_SPEECH_PATH.wav] --restore_step 250000 --mode single -p config/AISHELL3/preprocess.yaml -m config/AISHELL3/model.yaml -t config/AISHELL3/train.yaml
# For English
python3 synthesize.py --text "Human Computer Speech Interaction Lab at Tsinghua University, targets artificial intelligence technologies for smart voice user interface." --ref [REF_SPEECH_PATH.wav] --restore_step 250000 --mode single -p config/LibriTTS/preprocess.yaml -m config/LibriTTS/model.yaml -t config/LibriTTS/train.yaml
@misc{zhou2022content,
title={Content-Dependent Fine-Grained Speaker Embedding for Zero-Shot Speaker Adaptation in Text-to-Speech Synthesis},
author={Zhou, Yixuan and Song, Changhe and Li, Xiang and Zhang, Luwen and Wu, Zhiyong and Bian, Yanyao and Su, Dan and Meng, Helen},
year={2022},
eprint={2204.00990},
archivePrefix={arXiv},
primaryClass={eess.AS}
}