effusiveperiscope / so-vits-svc

so-vits-svc
MIT License
179 stars 71 forks source link

SoftVC VITS Singing Voice Conversion

Notice

Inference GUI 2 - Installation

On Windows, try the script under releases. Otherwise: pip install -r requirements.txt in a Python 3.8/3.9 (conda) environment. Additional features may be available based on other dependencies:

Basic Usage

Models should be placed in separate folders within a folder called models, in the same directory as inference_gui2.py by default. Specifically, the file structure should be:

so-vits-svc-eff\
    models\
        TwilightSparkle
            G_*****.pth
            D_*****.pth
            kmeans_*****.pt {may or may not be present for some models}
            config.json

If the proper libraries are installed, the GUI can be run simply by running inference_gui2.py. If everything goes well you should see something like this (some features may not be available depending on what extra libraries you have installed):

All basic workflow occurs under the leftmost UI panel.

  1. Select a speaker based on the listed names under Speaker:.
  2. Drag and drop reference audio files to be converted onto Files to Convert. Alternatively, click on Files to Convert or Recent Directories to open a file dialog.
  3. Set desired transpose (for m2f vocal conversion this is usually 12 i.e. an octave, or leave it 0 if the reference audio is female) under Transpose.
  4. Click Convert. The resulting file should appear under results.

The right UI panel allows for recording audio directly into the GUI for quick fixes and tests. Simply select the proper audio device and click Record to begin recording. Recordings will automatically be saved to a recordings folder. The resulting recording can be transferred to the so-vits-svc panel by pressing Push last output to so-vits-svc.

Common issues

Other options

Cool features

Running with TalkNet

For TalkNet support, you need to pip install requests and also install this ControllableTalkNet fork. Instead of running talknet_offline.py, run alt_server.py (if you use a batch script or conda environment to run TalkNet, you should use it to run alt_server.py). This will start a server that can interface with Inference GUI 2. The TalkNet server should be started before Inference GUI 2.

Next, starting Inference GUI 2 should show a UI like this:

The rightmost panel shows controls for TalkNet which are similar to those used in the web interface. Some items special to this interface:

Model Overview

A singing voice coversion (SVC) model, using the SoftVC encoder to extract features from the input audio, sent into VITS along with the F0 to replace the original input to acheive a voice conversion effect. Additionally, changing the vocoder to NSF HiFiGAN to fix the issue with unwanted staccato.

Notice

4.0 Features

Demo:Hugging Face Spaces

Required downloads

wget -P logs/44k/ https://huggingface.co/therealvul/so-vits-svc-4.0-init/resolve/main/G_0.pth
wget -P logs/44k/ https://huggingface.co/therealvul/so-vits-svc-4.0-init/resolve/main/D_0.pth

Colab notebook scripts

Colab training notebook (EN)

Colab inference notebook (EN)

Note that the following notebooks are not maintained by me.

Colab training notebook (CN)

Dataset preparation

All that is required is that the data be put under the dataset_raw folder in the structure format provided below.

dataset_raw
├───speaker0
│   ├───xxx1-xxx1.wav
│   ├───...
│   └───Lxx-0xx8.wav
└───speaker1
    ├───xx2-0xxx2.wav
    ├───...
    └───xxx7-xxx007.wav

Data pre-processing.

  1. Resample to 44100hz
python resample.py
  1. Automatically sort out training set, validation set, test set, and automatically generate configuration files.
    python preprocess_flist_config.py
  2. Generate hubert and F0 features/
    python preprocess_hubert_f0.py

    After running the step above, the dataset folder will contain all the pre-processed data, you can delete the dataset_raw folder after that.

Training.

python train.py -c configs/config.json -m 44k

Note: The old model will be automatically cleared during training, and only the latest 5 models will be kept. If you want to prevent overfitting, you need to manually back up the model record points, or modify the configuration file keep_ckpts 0 to never clear.

To train a cluster model, train a so-vits-svc 4.0 model first (as above), then execute python cluster/train_cluster.py.

Inference

For instructions on using the GUI see the eff branch Otherwise use inference_main.py Command line support has been added for inference

# Example
python inference_main.py -m "logs/44k/G_30400.pth" -c "configs/config.json" -n "君の知らない物語-src.wav" -t 0 -s "nen"

Required fields

Optional fields

Optional fields

Automatic f0 prediction

The 4.0 model training process will train an f0 predictor. For voice conversion you can enable automatic pitch prediction. Do not enable this function when converting singing voices unless you want it to be out of tune.

Cluster timbre leakage

Clustering is used to make the model trained more like the target timbre at the cost of articulation/intelligibility. The model can linearly control the proportion of non-clustering scheme (more intelligible, 0) vs. clustering scheme (more speaker-like, 1).

Onnx export

Use onnx_export.py