A minimum VRAM requirement of 6GB for training
Support for multiple speakers
Create unique speakers through speaker mixing
It can even convert voices with light accompaniment
You can edit F0 using Excel
https://github.com/PlayVoice/so-vits-svc-5.0/assets/16432329/6a09805e-ab93-47fe-9a14-9cbc1e0e7c3a
Powered by @ShadowVap
Feature | From | Status | Function | |
---|---|---|---|---|
whisper | OpenAI | ✅ | strong noise immunity | |
bigvgan | NVIDA | ✅ | alias and snake | The formant is clearer and the sound quality is obviously improved |
natural speech | Microsoft | ✅ | reduce mispronunciation | |
neural source-filter | Xin Wang | ✅ | solve the problem of audio F0 discontinuity | |
pitch quantization | Xin Wang | ✅ | quantize the F0 for embedding | |
speaker encoder | ✅ | Timbre Encoding and Clustering | ||
GRL for speaker | Ubisoft | ✅ | Preventing Encoder Leakage Timbre | |
SNAC | Samsung | ✅ | One Shot Clone of VITS | |
SCLN | Microsoft | ✅ | Improve Clone | |
Diffusion | HuaWei | ✅ | Improve sound quality | |
PPG perturbation | this project | ✅ | Improved noise immunity and de-timbre | |
HuBERT perturbation | this project | ✅ | Improved noise immunity and de-timbre | |
VAE perturbation | this project | ✅ | Improve sound quality | |
MIX encoder | this project | ✅ | Improve conversion stability | |
USP infer | this project | ✅ | Improve conversion stability | |
HiFTNet | Columbia University | ✅ | NSF-iSTFTNet for speed up | |
RoFormer | Zhuiyi Technology | ✅ | Rotary Positional Embeddings |
due to the use of data perturbation, it takes longer to train than other projects.
USP : Unvoice and Silence with Pitch when infer
Install PyTorch.
Install project dependencies
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt
Note: whisper is already built-in, do not install it again otherwise it will cuase conflict and error
Download the Timbre Encoder: Speaker-Encoder by @mueller91, put best_model.pth.tar
into speaker_pretrain/
.
Download whisper model whisper-large-v2. Make sure to download large-v2.pt
,put it into whisper_pretrain/
.
Download hubert_soft model,put hubert-soft-0d54a1f4.pt
into hubert_pretrain/
.
Download pitch extractor crepe full,put full.pth
into crepe/assets
.
Note: crepe full.pth is 84.9 MB, not 6kb
Download pretrain model sovits5.0.pretrain.pth, and put it into vits_pretrain/
.
python svc_inference.py --config configs/base.yaml --model ./vits_pretrain/sovits5.0.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav
Necessary pre-processing:
dataset_raw
directory following the structure below.
dataset_raw
├───speaker0
│ ├───000001.wav
│ ├───...
│ └───000xxx.wav
└───speaker1
├───000001.wav
├───...
└───000xxx.wav
python svc_preprocessing.py -t 2
-t
: threading, max number should not exceed CPU core count, usually 2 is enough.
After preprocessing you will get an output with following structure.
data_svc/
└── waves-16k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── waves-32k
│ └── speaker0
│ │ ├── 000001.wav
│ │ └── 000xxx.wav
│ └── speaker1
│ ├── 000001.wav
│ └── 000xxx.wav
└── pitch
│ └── speaker0
│ │ ├── 000001.pit.npy
│ │ └── 000xxx.pit.npy
│ └── speaker1
│ ├── 000001.pit.npy
│ └── 000xxx.pit.npy
└── hubert
│ └── speaker0
│ │ ├── 000001.vec.npy
│ │ └── 000xxx.vec.npy
│ └── speaker1
│ ├── 000001.vec.npy
│ └── 000xxx.vec.npy
└── whisper
│ └── speaker0
│ │ ├── 000001.ppg.npy
│ │ └── 000xxx.ppg.npy
│ └── speaker1
│ ├── 000001.ppg.npy
│ └── 000xxx.ppg.npy
└── speaker
│ └── speaker0
│ │ ├── 000001.spk.npy
│ │ └── 000xxx.spk.npy
│ └── speaker1
│ ├── 000001.spk.npy
│ └── 000xxx.spk.npy
└── singer
│ ├── speaker0.spk.npy
│ └── speaker1.spk.npy
|
└── indexes
├── speaker0
│ ├── some_prefix_hubert.index
│ └── some_prefix_whisper.index
└── speaker1
├── hubert.index
└── whisper.index
Re-sampling
Generate audio with a sampling rate of 16000Hz in ./data_svc/waves-16k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-16k -s 16000
Generate audio with a sampling rate of 32000Hz in ./data_svc/waves-32k
python prepare/preprocess_a.py -w ./dataset_raw -o ./data_svc/waves-32k -s 32000
python prepare/preprocess_crepe.py -w data_svc/waves-16k/ -p data_svc/pitch
python prepare/preprocess_ppg.py -w data_svc/waves-16k/ -p data_svc/whisper
python prepare/preprocess_hubert.py -w data_svc/waves-16k/ -v data_svc/hubert
python prepare/preprocess_speaker.py data_svc/waves-16k/ data_svc/speaker
python prepare/preprocess_speaker_ave.py data_svc/speaker/ data_svc/singer
python prepare/preprocess_spec.py -w data_svc/waves-32k/ -s data_svc/specs
python prepare/preprocess_train.py
python prepare/preprocess_zzz.py
If fine-tuning is based on the pre-trained model, you need to download the pre-trained model: sovits5.0.pretrain.pth. Put pretrained model under project root, change this line
pretrain: "./vits_pretrain/sovits5.0.pretrain.pth"
in configs/base.yaml
,and adjust the learning rate appropriately, eg 5e-5.
batch_size
: for GPU with 6G VRAM, 6 is the recommended value, 8 will work but step speed will be much slower.
python svc_trainer.py -c configs/base.yaml -n sovits5.0
python svc_trainer.py -c configs/base.yaml -n sovits5.0 -p chkpt/sovits5.0/sovits5.0_***.pt
tensorboard --logdir logs/
Export inference model: text encoder, Flow network, Decoder network
python svc_export.py --config configs/base.yaml --checkpoint_path chkpt/sovits5.0/***.pt
Inference
f0
, just run the following command.
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0
f0
will be adjusted manually, follow the steps:
test.vec.npy
.
python whisper/inference.py -w test.wav -p test.ppg.npy
python hubert/inference.py -w test.wav -v test.vec.npy
python pitch/inference.py -w test.wav -p test.csv
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --ppg test.ppg.npy --vec test.vec.npy --pit test.csv --shift 0
Notes
when --ppg
is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
when --vec
is specified, when the same audio is reasoned multiple times, it can avoid repeated extraction of audio content codes; if it is not specified, it will be automatically extracted;
when --pit
is specified, the manually tuned F0 parameter can be loaded; if not specified, it will be automatically extracted;
generate files in the current directory:svc_out.wav
Arguments ref
args | --config | --model | --spk | --wave | --ppg | --vec | --pit | --shift |
---|---|---|---|---|---|---|---|---|
name | config path | model path | speaker | wave input | wave ppg | wave hubert | wave pitch | pitch shift |
post by vad
python svc_inference_post.py --ref test.wav --svc svc_out.wav --out svc_out_post.wav
To increase the stability of the generated timbre, you can use the method described in the Retrieval-based-Voice-Conversion repository. This method consists of 2 steps:
Training the retrieval index on hubert and whisper features Run training with default settings:
python svc_train_retrieval.py
If the number of vectors is more than 200_000 they will be compressed to 10_000 using the MiniBatchKMeans algorithm. You can change these settings using command line options:
usage: crate faiss indexes for feature retrieval [-h] [--debug] [--prefix PREFIX] [--speakers SPEAKERS [SPEAKERS ...]] [--compress-features-after COMPRESS_FEATURES_AFTER]
[--n-clusters N_CLUSTERS] [--n-parallel N_PARALLEL]
options:
-h, --help show this help message and exit
--debug
--prefix PREFIX add prefix to index filename
--speakers SPEAKERS [SPEAKERS ...]
speaker names to create an index. By default all speakers are from data_svc
--compress-features-after COMPRESS_FEATURES_AFTER
If the number of features is greater than the value compress feature vectors using MiniBatchKMeans.
--n-clusters N_CLUSTERS
Number of centroids to which features will be compressed
--n-parallel N_PARALLEL
Nuber of parallel job of MinibatchKmeans. Default is cpus-1
Compression of training vectors can speed up index inference, but reduces the quality of the retrieve. Use vector count compression if you really have a lot of them.
The resulting indexes will be stored in the "indexes" folder as:
data_svc
...
└── indexes
├── speaker0
│ ├── some_prefix_hubert.index
│ └── some_prefix_whisper.index
└── speaker1
├── hubert.index
└── whisper.index
At the inference stage adding the n closest features in a certain proportion of the vits model Enable Feature Retrieval with settings:
python svc_inference.py --config configs/base.yaml --model sovits5.0.pth --spk ./data_svc/singer/your_singer.spk.npy --wave test.wav --shift 0 \
--enable-retrieval \
--retrieval-ratio 0.5 \
--n-retrieval-vectors 3
For a better retrieval effect, you can try to cycle through different parameters: --retrieval-ratio
and --n-retrieval-vectors
If you have multiple sets of indexes, you can specify a specific set via the parameter: --retrieval-index-prefix
You can explicitly specify the paths to the hubert and whisper indexes using the parameters: --hubert-index-path
and --whisper-index-path
named by pure coincidence:average -> ave -> eva,eve(eva) represents conception and reproduction
python svc_eva.py
eva_conf = {
'./configs/singers/singer0022.npy': 0,
'./configs/singers/singer0030.npy': 0,
'./configs/singers/singer0047.npy': 0.5,
'./configs/singers/singer0051.npy': 0.5,
}
the generated singer file will be eva.spk.npy
.
https://github.com/facebookresearch/speech-resynthesis paper
https://github.com/jaywalnut310/vits paper
https://github.com/openai/whisper/ paper
https://github.com/NVIDIA/BigVGAN paper
https://github.com/mindslab-ai/univnet paper
https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
https://github.com/huawei-noah/Speech-Backbones/tree/main/Grad-TTS
https://github.com/brentspell/hifi-gan-bwe
https://github.com/mozilla/TTS
https://github.com/bshall/soft-vc
https://github.com/maxrmorrison/torchcrepe
https://github.com/MoonInTheRiver/DiffSinger
https://github.com/OlaWod/FreeVC paper
https://github.com/yl4579/HiFTNet paper
Autoregressive neural f0 model for statistical parametric speech synthesis
Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers
AdaSpeech: Adaptive Text to Speech for Custom Voice
AdaVITS: Tiny VITS for Low Computing Resource Speaker Adaptation
Cross-Speaker Prosody Transfer on Any Text for Expressive Speech Synthesis
Multilingual Speech Synthesis and Cross-Language Voice Cloning: GRL
RoFormer: Enhanced Transformer with rotary position embedding
https://github.com/auspicious3000/contentvec/blob/main/contentvec/data/audio/audio_utils_1.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/praat.py
https://github.com/revsic/torch-nansy/blob/main/utils/augment/peq.py
https://github.com/biggytruck/SpeechSplit2/blob/main/utils.py
https://github.com/OlaWod/FreeVC/blob/main/preprocess_sr.py
https://github.com/Francis-Komizu/Sovits
2022.04.12 https://mp.weixin.qq.com/s/autNBYCsG4_SvWt2-Ll_zA
2022.04.22 https://github.com/PlayVoice/VI-SVS
2022.07.26 https://mp.weixin.qq.com/s/qC4TJy-4EVdbpvK2cQb1TA
2022.09.08 https://github.com/PlayVoice/VI-SVC