Provides scripts for conveniently training your own acoustic model for Soft-VC, also providing a list of experimentally pretrained acoustic models.
使用 Soft-VC 制作变声器应用时只需要重新训练自己的声学模型 Acoustic Model 就行了
本仓库提供一些整合脚本可以很方便地训练自己的声学模型 :),你也可以尝试下载一些预先训练好的音色库 (虽然几乎都很垃圾不好用!!)
基础代码由官方soft-vc的四个仓库删减修改整合而来
音色 | 声库名(vbank) | 说明 | 语料 | 语料时长 | 最优检查点 | 听感状态 |
---|---|---|---|---|---|---|
LJSpeech | ljspeech | 英语女性成人 | LJSpeech公开数据集 | 24h | 32k steps | 可用 |
DataBaker | databaker | 汉语普通话女性成人 | DataBaker公开数据集 | 10h | 25k steps | 可用 |
阿消 | 500 | 日语女性少年 | 游戏内语音(明日方舟) | 500 steps | 别急,还在做 | |
卡达 | click | 日语女性少年 | 游戏内语音(明日方舟) | 3250 steps | 别急,还在做 | |
红云 | vermeil | 日语女性少年 | 游戏内语音(明日方舟) | 5000 steps | 别急,还在做 | |
阿 | aak | 日语男性少年 | 游戏内语音(明日方舟) | 2250 steps | 别急,还在做 | |
水月 | mizuki | 日语男性少年 | 游戏内语音(明日方舟) | 4750 steps | 别急,还在做 | |
罗小黑 | luoxiaohei | 日语男性少年 | 游戏内语音(明日方舟) | 5000 steps | 别急,还在做 | |
爽 | sou | 日语男性少年 | 歌声提取(空詩音レミ的中之人) | 0.243h | 11k steps | 撕裂,局部平声 |
空詩音レミ | lemi | 日语男性少年 (DeepVocal) | 歌声合成导出 | 0.351h | 34k steps | 高音撕裂 |
鏡音レン | len | 日语男性少年 (Vocaloid) | 歌声合成导出 | 0.575h | 36k steps | 高音撕裂 |
はなinit | hana | 日语中性少年 (UTAU) | 歌声合成导出+声库录音 | 1.672h | 37k steps | 勉强能听,几乎平声 |
旭音エマ | ema | 日语中性少年 (UTAU) | 歌声合成导出+声库录音 | 0.433h | 2k steps | 严重撕裂 |
狽音ウルシ | urushi | 日语男性少年 (UTAU) | 声库录音 | 0.190h | 36k steps | 完全平声 |
兰斯 | lansi | 汉语普通话男性少年 (UTAU) | 声库录音(+数据增强) | 5.417h | 21k steps | 勉强能听,几乎平声 |
钢琴 | piano | 钢琴和弦乐 | 钢琴曲和少量弦乐协奏曲 | 0.800h | 32K steps | 怪 |
⚠️ 自然人声音受到当地法律保护,应仅出于个人学习、艺术欣赏、课堂教学或者科学研究等目的作必要使用。
⚠️ The voice of natural persons is protected by local laws and shall be used ONLY for necessary purposes such as personal study, artistic appreciation, teaching or scientific research.
Pretained model checkpoints could be found here: https://pan.quark.cn/s/f9ac2b933d7e.
We train each vbank for equally 40k
steps, but only the best checkpoint is published.
ℹ️ Note: not all vbanks are pleasing due to very very limited, even pitch-non-variant training data,
please check the audio samples in index.html
for a comprehensive understanding.
For discussions on how many data is necessarily needed to train a satisfactory voice bank,
refer to this repo: soft-vc-acoustic-model-ablation-study.
For discussions on how to build a timbre model from pitch-non-variant data only,
refer to this repo: hubert-pitdyn.
⚪ Use pretrained voice banks
Download and put the pretrained checkpoint file at path out\<vbank>\model-best.pt
where vbank
is name of the voicebank
python infer.py <vbank> <input> => <input> can be both file or folder
python infer.py ljspeech test\000001.wav => gen\000001_ljspeech.wav
python infer.py hana test => gen\*_hana.wav
converted outputs are in default generated under gen
folder, files named with tailing suffix _<vbank>
# imports
import torch
import torchaudio
from torch.nn.modules.utils import consume_prefix_in_state_dict_if_present
from scipy.io import wavfile
device = 'cuda' if torch.cuda.is_available() else 'cpu'
# load model
hubert = torch.hub.load("bshall/hubert:main", "hubert_soft").to(device)
acoustic = torch.hub.load("bshall/acoustic-model:main", "hubert_soft").to(device)
hifigan = torch.hub.load("bshall/hifigan:main", "hifigan_hubert_soft").to(device)
# load checkpoint
vbank = 'ljspeech' # or 'hana', etc..
ckpt_fp = f'log/{vbank}/model-best.pt'
ckpt = torch.load(ckpt_fp, map_location=device)
consume_prefix_in_state_dict_if_present(ckpt["acoustic-model"], "module.")
acoustic.load_state_dict(ckpt["acoustic-model"])
# load wavfile
wav_fp = r'test\000001.wav' # or whatever you want to convert from
source, sr = torchaudio.load(wav_fp)
source = torchaudio.functional.resample(source, sr, 16000)
source = source.unsqueeze(0).to(device)
# do soft-vc transform
with torch.inference_mode():
units = hubert.units(source)
mel = acoustic.generate(units).transpose(1, 2)
target = hifigan(mel)
# save wavfile
y_hat = target.squeeze().cpu().numpy()
wavfile.write('converted.wav', 16000, y_hat)
see more details in demo.ipynb
and infer.py
⚪ Train your own voice bank
ℹ️ Note that each acoustic model is typically treated as one timbre, so training on a multi-speaker dataset might probably get a confused timbre. Hence I will probably try to develop global-conditioned multi-timbre acoustic model in the near future :)
wavpath
<config>.json
under configs
folder (refer to configs\default.json
which is defaultly used)pip install -r requirements.txt
make_preprocess.cmd <vbank> <wavpath>
make_preprocess.cmd ljspeech C:\LJSpeech-1.1\wavs
make_train.cmd <vbank> [config] [resume]
, and wait for 2000 years over :laughing:make_preprocess.cmd ljspeech default
Makefile
:make dirs VBANK=<vbank> WAVPATH=<wavpath>
creates necessary folder hierachy and soft-linksmake units VBANK=<vbank>
encodes wavforms to hubert's hidden-unitsmake mels VBANK=<vbank>
transforms wavforms to log-mel spectrograms make train VBANK=<vbank> CONFIG=[config] RESUME=[resume]
trains the acoustic model with paired data (unit, mel)make train_resume VBANK=<vbank> CONFIG=[config]
resumes training on the saved model-best.pt
data\<vbank>\*
, while model checkpoints are saved in log\<vbank>
make stats VBANK=<vbank>
python infer.py <vbank> <input>
(e.g. python infer.py ljspeech test
) to generate converted wavfiles for folder <input>
If you have neither make.exe
nor cmd.exe
, you can directly use the python scripts:
# Assure you have created the directory hierachy:
# mkdir data/<vbank> data/<vbank>/units data/<vbank>/mels log
# mklink /J data/<vbank>/wavs path/to/vbank/wavpath
python preprocess.py vbank <--encode|melspec>
python train.py vbank --config CONFIG [--resume RESUME]
python infer.py vbank input [--log_path LOG_PATH]
.
├── thesis/ // 参考用原始论文
├── acoustic/ // 声学模型代码
├── configs/ // 训练用超参数配置
│ ├── default.json
│ ├── <config>.json
│ └── ...
├── data/ // 训练用数据文件
│ ├── <vbank>/
│ │ ├── wavs/ // 指向<wavpath>的目录软连接 (由mklink产生)
│ │ ├── units/ // preprocess产生的HuBERT特征
│ │ └── mels/ // preprocess产生的Mel谱特征
│ └── ...
├── log/ // 模型权重保存点 + 日志统计
│ ├── <vbank>/
│ │ ├── logs/ // 日志(`*.log`) + TFBoard(`events.out.tfevents.*`)
│ │ ├── model-best.pt // 最优检查点
│ │ ├── model-<steps>.pt // 中间检查点
│ └── ...
├── preprocess.py // 数据预处理代码
├── train.py // 训练代码
├── infer.py // 合成代码 (Commandline API)
├── demo.ipynb // 编程API示例 (Programmatic API)
|── ...
├── mk_train.cmd // 自定义语音库预处理脚本 (仅预处理,步骤1~3)
├── mk_preprocess.cmd // 自定义语音库训练脚本 (仅训练,步骤4)
├── Makefile // 自定义语音库任务脚本 (分步骤)
|── ...
├── test/ // demo源数据集
├── gen/ // demo生成数据集 (demo源数据集在demo声库上产生的转换结果)
├── index.html // demo列表页面
├── mk_index.py // demo页面生成脚本 (产生index.html)
└── mk_infer_test.cmd // demo生成数据集生成脚本 (产生gen/)
ℹ️ These developed scripts and tools are targeted mainly for Windows platform, if you work on Linux or Mac, you possibly need to modify on your own :(
Great thanks to the founding authors of Soft-VC! :lollipop:
@inproceedings{
soft-vc-2022,
author={van Niekerk, Benjamin and Carbonneau, Marc-André and Zaïdi, Julian and Baas, Matthew and Seuté, Hugo and Kamper, Herman},
booktitle={ICASSP},
title={A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion},
year={2022}
}
by Armit 2022/09/12