soft-vc-acoustic-models

Provides scripts for conveniently training your own acoustic model for Soft-VC, also providing a list of experimentally pretrained acoustic models.

使用 Soft-VC 制作变声器应用时只需要重新训练自己的声学模型 Acoustic Model 就行了
本仓库提供一些整合脚本可以很方便地训练自己的声学模型 :)，你也可以尝试下载一些预先训练好的音色库 (虽然几乎都很垃圾不好用！！)
基础代码由官方soft-vc的四个仓库删减修改整合而来

Voice Banks

音色	声库名(vbank)	说明	语料	语料时长	最优检查点	听感状态
LJSpeech	ljspeech	英语女性成人	LJSpeech公开数据集	24h	32k steps	可用
DataBaker	databaker	汉语普通话女性成人	DataBaker公开数据集	10h	25k steps	可用
阿消	500	日语女性少年	游戏内语音(明日方舟)		500 steps	别急，还在做
卡达	click	日语女性少年	游戏内语音(明日方舟)		3250 steps	别急，还在做
红云	vermeil	日语女性少年	游戏内语音(明日方舟)		5000 steps	别急，还在做
阿	aak	日语男性少年	游戏内语音(明日方舟)		2250 steps	别急，还在做
水月	mizuki	日语男性少年	游戏内语音(明日方舟)		4750 steps	别急，还在做
罗小黑	luoxiaohei	日语男性少年	游戏内语音(明日方舟)		5000 steps	别急，还在做
爽	sou	日语男性少年	歌声提取(空詩音レミ的中之人)	0.243h	11k steps	撕裂，局部平声
空詩音レミ	lemi	日语男性少年 (DeepVocal)	歌声合成导出	0.351h	34k steps	高音撕裂
鏡音レン	len	日语男性少年 (Vocaloid)	歌声合成导出	0.575h	36k steps	高音撕裂
はなinit	hana	日语中性少年 (UTAU)	歌声合成导出+声库录音	1.672h	37k steps	勉强能听，几乎平声
旭音エマ	ema	日语中性少年 (UTAU)	歌声合成导出+声库录音	0.433h	2k steps	严重撕裂
狽音ウルシ	urushi	日语男性少年 (UTAU)	声库录音	0.190h	36k steps	完全平声
兰斯	lansi	汉语普通话男性少年 (UTAU)	声库录音(+数据增强)	5.417h	21k steps	勉强能听，几乎平声
钢琴	piano	钢琴和弦乐	钢琴曲和少量弦乐协奏曲	0.800h	32K steps	怪

⚠️ 自然人声音受到当地法律保护，应仅出于个人学习、艺术欣赏、课堂教学或者科学研究等目的作必要使用。
⚠️ The voice of natural persons is protected by local laws and shall be used ONLY for necessary purposes such as personal study, artistic appreciation, teaching or scientific research.

Pretained model checkpoints could be found here: https://pan.quark.cn/s/f9ac2b933d7e.
We train each vbank for equally 40k steps, but only the best checkpoint is published.

ℹ️ Note: not all vbanks are pleasing due to very very limited, even pitch-non-variant training data, please check the audio samples in index.html for a comprehensive understanding.
For discussions on how many data is necessarily needed to train a satisfactory voice bank, refer to this repo: soft-vc-acoustic-model-ablation-study. For discussions on how to build a timbre model from pitch-non-variant data only, refer to this repo: hubert-pitdyn.

Quick Start

⚪ Use pretrained voice banks

Commandline API

Download and put the pretrained checkpoint file at path out\<vbank>\model-best.pt where vbank is name of the voicebank

python infer.py <vbank> <input>            => <input> can be both file or folder
python infer.py ljspeech test\000001.wav   => gen\000001_ljspeech.wav
python infer.py hana test                  => gen\*_hana.wav

converted outputs are in default generated under gen folder, files named with tailing suffix _<vbank>

Programmatic API

# imports
import torch
import torchaudio
from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
from scipy.io import wavfile

device = 'cuda' if torch.cuda.is_available() else 'cpu'

# load model
hubert   = torch.hub.load("bshall/hubert:main",          "hubert_soft").to(device)
acoustic = torch.hub.load("bshall/acoustic-model:main",  "hubert_soft").to(device)
hifigan  = torch.hub.load("bshall/hifigan:main", "hifigan_hubert_soft").to(device)

# load checkpoint
vbank = 'ljspeech'                 # or 'hana', etc..
ckpt_fp = f'log/{vbank}/model-best.pt'
ckpt = torch.load(ckpt_fp, map_location=device)
consume_prefix_in_state_dict_if_present(ckpt["acoustic-model"], "module.")
acoustic.load_state_dict(ckpt["acoustic-model"])

# load wavfile
wav_fp = r'test\000001.wav'      # or whatever you want to convert from
source, sr = torchaudio.load(wav_fp)
source = torchaudio.functional.resample(source, sr, 16000)
source = source.unsqueeze(0).to(device)

# do soft-vc transform
with torch.inference_mode():
  units = hubert.units(source)
  mel = acoustic.generate(units).transpose(1, 2)
  target = hifigan(mel)

# save wavfile
y_hat = target.squeeze().cpu().numpy()
wavfile.write('converted.wav', 16000, y_hat)

see more details in demo.ipynb and infer.py

⚪ Train your own voice bank

ℹ️ Note that each acoustic model is typically treated as one timbre, so training on a multi-speaker dataset might probably get a confused timbre. Hence I will probably try to develop global-conditioned multi-timbre acoustic model in the near future :)

prepare a folder containing *.wav files (currently *.mp3 not supported), aka. wavpath
(optional) create a config file <config>.json under configs folder (refer to configs\default.json which is defaultly used)
install dependencies pip install -r requirements.txt
use the two-stage separated scripts for preprocessing and training routine:
- preprocess with make_preprocess.cmd <vbank> <wavpath>
- e.g. make_preprocess.cmd ljspeech C:\LJSpeech-1.1\wavs
- train with make_train.cmd <vbank> [config] [resume], and wait for 2000 years over :laughing:
- e.g. make_preprocess.cmd ljspeech default
- or if you wants to perform step by step, refer to recipes in Makefile:
- make dirs VBANK=<vbank> WAVPATH=<wavpath> creates necessary folder hierachy and soft-links
- make units VBANK=<vbank> encodes wavforms to hubert's hidden-units
- make mels VBANK=<vbank> transforms wavforms to log-mel spectrograms
- make train VBANK=<vbank> CONFIG=[config] RESUME=[resume] trains the acoustic model with paired data (unit, mel)
- make train_resume VBANK=<vbank> CONFIG=[config] resumes training on the saved model-best.pt
- NOTE: preprocessed features are generated in data\<vbank>\*, while model checkpoints are saved in log\<vbank>
you can launch TensorBoard summary by make stats VBANK=<vbank>
once train finished, run python infer.py <vbank> <input> (e.g. python infer.py ljspeech test) to generate converted wavfiles for folder <input>

If you have neither make.exe nor cmd.exe, you can directly use the python scripts:

# Assure you have created the directory hierachy:
# mkdir data/<vbank> data/<vbank>/units data/<vbank>/mels log
# mklink /J data/<vbank>/wavs path/to/vbank/wavpath

python preprocess.py vbank <--encode|melspec>
python train.py vbank --config CONFIG [--resume RESUME]
python infer.py vbank input [--log_path LOG_PATH]

Project Layout

.
├── thesis/                   // 参考用原始论文
├── acoustic/                 // 声学模型代码
├── configs/                  // 训练用超参数配置
│   ├── default.json
│   ├── <config>.json
│   └── ...
├── data/                     // 训练用数据文件
│   ├── <vbank>/
│   │   ├── wavs/             // 指向<wavpath>的目录软连接 (由mklink产生)
│   │   ├── units/            // preprocess产生的HuBERT特征
│   │   └── mels/             // preprocess产生的Mel谱特征
│   └── ...
├── log/                      // 模型权重保存点 + 日志统计
│   ├── <vbank>/
│   │   ├── logs/             // 日志(`*.log`) + TFBoard(`events.out.tfevents.*`)
│   │   ├── model-best.pt     // 最优检查点
│   │   ├── model-<steps>.pt  // 中间检查点
│   └── ...
├── preprocess.py             // 数据预处理代码
├── train.py                  // 训练代码
├── infer.py                  // 合成代码 (Commandline API)
├── demo.ipynb                // 编程API示例 (Programmatic API)
|── ...
├── mk_train.cmd              // 自定义语音库预处理脚本 (仅预处理，步骤1~3)
├── mk_preprocess.cmd         // 自定义语音库训练脚本 (仅训练，步骤4)
├── Makefile                  // 自定义语音库任务脚本 (分步骤)
|── ...
├── test/                     // demo源数据集
├── gen/                      // demo生成数据集 (demo源数据集在demo声库上产生的转换结果)
├── index.html                // demo列表页面
├── mk_index.py               // demo页面生成脚本 (产生index.html)
└── mk_infer_test.cmd         // demo生成数据集生成脚本 (产生gen/)

ℹ️ These developed scripts and tools are targeted mainly for Windows platform, if you work on Linux or Mac, you possibly need to modify on your own :(

References

Great thanks to the founding authors of Soft-VC! :lollipop:

@inproceedings{
  soft-vc-2022,
  author={van Niekerk, Benjamin and Carbonneau, Marc-André and Zaïdi, Julian and Baas, Matthew and Seuté, Hugo and Kamper, Herman},
  booktitle={ICASSP}, 
  title={A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion}, 
  year={2022}
}

soft-vc paper: https://ieeexplore.ieee.org/abstract/document/9746484
soft-vc code: https://github.com/bshall/soft-vc
- hubert: https://github.com/bshall/hubert
- acoustic-model: https://github.com/bshall/acoustic-model
- hifigan: https://github.com/bshall/hifigan

by Armit 2022/09/12

Kahsolt / soft-vc-acoustic-models

readme