This is the official implementation of CTX-vec2wav vocoder in the AAAI-2024 paper UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding.
See also: The official implementation of CTX-txt2vec, the acoustic model with contextual VQ-diffusion, proposed in the paper.
This repo is tested on python 3.9 on Linux. You can set up the environment with conda
# Install required packages
conda create -n ctxv2w python=3.9 # or any name you like
conda activate ctxv2w
pip install -r requirements.txt
# Then, set PATH and PYTHONPATH
source path.sh # change the env name if you don't use "ctxv2w"
The scripts in utils/
should be executable. You can run chmod +x utils/*
to ensure this.
The following process will also need bash
and perl
commands in your Linux environment.
For utterances that are already registered in data/
, the inference (VQ index + acoustic prompt) can be done by
bash run.sh --stage 3 --stop_stage 3
# You can specify the dataset to be constructed by "--eval_set $which_set", i.e. "--eval_set dev_all"
You can also create a subset can perform inference on it by
subset_data_dir.sh data/eval_all 200 data/eval_subset # randomly select 200 utts from data/eval_all
bash run.sh --stage 3 --stop_stage 3 --eval_set "eval_subset"
The program loads the latest checkpoint in the experiment dir exp/train_all_ctxv2w.v1/*pkl
.
💡Note: the stage 3 in run.sh
automatically selects the prompt for each utterance by random (see local/build_prompt_feat.py
).
You can customize this process and perform inference yourself:
feats.scp
that specifies each utterance (for inference) with its VQ index sequence in (L, 2)
shape.feat-to-len.py scp:/path/to/feats.scp > /path/to/utt2num_frames
.prompt.scp
that specifies each utterance with its acoustic (mel) prompt in (L', 80)
shape.# might change sampling rate.
decode.py \
--sampling-rate 16000 \
--feats-scp /path/to/feats.scp \
--prompt-scp /path/to/prompt.scp \
--num-frames /path/to/utt2num_frames \
--config /path/to/config.yaml \
--vq-codebook /path/to/codebook.npy \
--checkpoint /path/to/checkpoint \
--outdir /path/to/output/wav
First, you need to properly construct data
and feats
directory. Please check out data_prep for details.
💡Note: here we provide the 16kHz version of model and data. Meanwhile, the original paper uses 24kHz data, which was accomplished by using the features extracted in 16kHz and increasing the
upsample_scales
in the config yaml.
Then, training on LibriTTS (all training partitions) can be done by
bash run.sh --stage 2 --stop_stage 2
# You can provide different config file by --conf $your_config
# Checkout run.sh for all the parameters. You can specify every bash variable there as "--key value" in CLI.
This will create exp/train_all_ctxv2w.v1
for logging. The script automatically handles multi-GPU training if you specify the $CUDA_VISIBLE_DEVICES env variable.
We release two versions of model parameters (generator) on LibriTTS train-all set. These refer to two sampling rates of the target waveforms. Note that the acoustic features (fbank+ppe) are all extracted from 16k waveform. The only difference is the upsample rate in the HifiGAN generator.
The usage is the same as the "Inference" section. Feel free to use these checkpoints for vocoding!
CMVN file: in order to perform inference on out-of-set utterances, we provide the cmvn.ark
file here. You should extract mel-spectrogram, normalize by that file (computed on LibriTTS), and then feed the model.
During the development, the following repositories were referred to:
ctx_vec2wav/models/conformer
.utils/
.@article{du2023unicats,
title={UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding},
author={Du, Chenpeng and Guo, Yiwei and Shen, Feiyu and Liu, Zhijun and Liang, Zheng and Chen, Xie and Wang, Shuai and Zhang, Hui and Yu, Kai},
journal={arXiv preprint arXiv:2306.07547},
year={2023}
}