NTT123 / light-speed

A modified VITS that utilizes phoneme duration's ground truth for better robustness
MIT License
115 stars 35 forks source link

Light Speed ⚡

Light Speed ⚡ is an open-source text-to-speech model based on VITS, with some modifications:

Pretrained models and demos

We provide two pretrained models and demos:

FAQ

Q: How do I create training data?
A: See the ./prepare_ljs_tfdata.ipynb notebook for instructions on preparing the training data.

Q: How can I train the model with 1 GPU?
A: Run: python train.py

Q: How can I train the model with 4 GPUs?
A: Run: torchrun --standalone --nnodes=1 --nproc-per-node=4 train.py

Q: How can I train a model to predict phoneme durations?
A: See the ./train_duration_model.ipynb notebook.

Q: How can I generate speech with a trained model?
A: See the ./inference.ipynb notebook.

Credits