PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS
Abstract: Previous pitch-controllable text-to-speech (TTS) models rely on directly modeling fundamental frequency, leading to low variance in synthesized speech. To address this issue, we propose PITS, an end-to-end pitch-controllable TTS model that utilizes variational inference to model pitch. Based on VITS, PITS incorporates the Yingram encoder, the Yingram decoder, and adversarial training of pitch-shifted synthesis to achieve pitch-controllability. Experiments demonstrate that PITS generates high-quality speech that is indistinguishable from ground truth speech and has high pitch-controllability without quality degradation. Code and audio samples will be available at https://github.com/anonymous-pits/pits.
Training code is uploaded.
Demo and Checkpoint are uploaded at Hugging Face Space🤗
Audio samples are uploaded at github.io.
For the pitch-shifted Inference, we unify to use the notation in scope-shift, s, instead of pitch-shift.
Voice conversion samples are uploaded.
Accepted to ICML 2023 Workshop on Structured Probabilistic Inference & Generative Modeling
docker build -t=pits .
22050 Hz
, 16 bit
, .wav
format.22050 Hz
, 16 bit
, .wav
format. Some issues are reporting training failures for other sampling rate, we do not ensure for other sampling rates.text
folder.train.py
, check train.py for detailed options.
python train.py -c configs/config_en.yaml -m {MODEL_NAME} {-i:if you change yingram setup or etc}
Demo and Checkpoint are uploaded at Hugging Face Space🤗
We are currently working in progress to make dockerfile for local demo. Please wait for it.