SayaSS / vits-finetuning

Fine-Tuning your VITS model using a pre-trained model

MIT License

551 stars 86 forks source link

readme

text cleaner from https://github.com/CjangCjengh/vits

original repo: https://github.com/jaywalnut310/vits

Online training and inference

colab

See vits-finetuning

How to use

(Suggestion) Python == 3.7

Only Japanese datasets can be used for fine-tuning in this repo.

Clone this repository

git clone https://github.com/SayaSS/vits-finetuning.git

Install requirements

pip install -r requirements.txt

Download pre-trained model

G_0.pth
D_0.pth
Edit "model_dir"(line 152) in utils.py
Put pre-trained models in the "model_dir"/checkpoints

If you need to customize "n_speakers", please replace the pre-trained model with these two.

Create datasets

Speaker ID should be between 0-803.
About 50 audio-text pairs will suffice and 100-600 epochs could have quite good performance, but more data may be better.
Resample all audio to 22050Hz, 16-bit, mono wav files.
Audio files should be >=1s and <=10s.
```
path/to/XXX.wav|speaker id|transcript
```
Example

dataset/001.wav|10|こんにちは。

For complete examples, please see filelists/miyu_train.txt and filelists/miyu_val.txt.

Preprocess

python preprocess.py --filelists path/to/filelist_train.txt path/to/filelist_val.txt

Edit "training_files" and "validation_files" in configs/config.json

Train

# Mutiple speakers
python train_ms.py -c configs/config.json -m checkpoints