Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-to-end single-stage approach.
Demo: https://vits-2.github.io/demo/
Paper: https://arxiv.org/abs/2307.16430
Unofficial implementation of VITS2. This is a work in progress. Please refer to TODO for more details.
Duration Predictor | Normalizing Flows | Text Encoder |
---|---|---|
[In progress]
Audio sample after 52,000 steps of training on 1 GPU for LJSpeech dataset: https://github.com/daniilrobnikov/vits2/assets/91742765/d769c77a-bd92-4732-96e7-ab53bf50d783
Clone the repo
git clone git@github.com:daniilrobnikov/vits2.git
cd vits2
This is assuming you have navigated to the vits2
root after cloning it.
NOTE: This is tested under python3.11
with conda env. For other python versions, you might encounter version conflicts.
PyTorch 2.0 Please refer requirements.txt
# install required packages (for pytorch 2.0)
conda create -n vits2 python=3.11
conda activate vits2
pip install -r requirements.txt
conda env config vars set PYTHONPATH="/path/to/vits2"
There are three options you can choose from: LJ Speech, VCTK, or custom dataset.
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xvf LJSpeech-1.1.tar.bz2
cd LJSpeech-1.1/wavs
rm -rf wavs
python preprocess/mel_transform.py --data_dir /path/to/LJSpeech-1.1 -c datasets/ljs_base/config.yaml
preprocess text. See prepare/filelists.ipynb
rename or create a link to the dataset folder.
ln -s /path/to/LJSpeech-1.1 DUMMY1
wget https://datashare.is.ed.ac.uk/bitstream/handle/10283/3443/VCTK-Corpus-0.92.zip
unzip VCTK-Corpus-0.92.zip
(optional): downsample the audio files to 22050 Hz. See audio_resample.ipynb
preprocess mel-spectrograms. See mel_transform.py
python preprocess/mel_transform.py --data_dir /path/to/VCTK-Corpus-0.92 -c datasets/vctk_base/config.yaml
preprocess text. See prepare/filelists.ipynb
rename or create a link to the dataset folder.
ln -s /path/to/VCTK-Corpus-0.92 DUMMY2
ljs_base
in datasets
directory and rename it to custom_base
config.yaml
:data:
training_files: datasets/custom_base/filelists/train.txt
validation_files: datasets/custom_base/filelists/val.txt
text_cleaners: # See text/cleaners.py
- phonemize_text
- tokenize_text
- add_bos_eos
cleaned_text: true # True if you ran step 6.
language: en-us # language of your dataset. See espeak-ng
sample_rate: 22050 # sample rate, based on your dataset
...
n_speakers: 0 # 0 for single speaker, > 0 for multi-speaker
python preprocess/mel_transform.py --data_dir /path/to/custom_dataset -c datasets/custom_base/config.yaml
NOTE: You may need to install espeak-ng
if you want to use phonemize_text
cleaner. Please refer espeak-ng
ln -s /path/to/custom_dataset DUMMY3
# LJ Speech
python train.py -c datasets/ljs_base/config.yaml -m ljs_base
# VCTK
python train_ms.py -c datasets/vctk_base/config.yaml -m vctk_base
# Custom dataset (multi-speaker)
python train_ms.py -c datasets/custom_base/config.yaml -m custom_base
See inference.ipynb and inference_batch.ipynb
[In progress]