ADD nnsvs - Githubissues

axinc-ai / ailia-models

The collection of pre-trained, state-of-the-art AI models for ailia SDK

2.01k stars 319 forks source link

ADD nnsvs #1193

Open kyakuno opened 1 year ago

kyakuno commented 1 year ago

https://github.com/nnsvs/nnsvs mit

kyakuno commented 1 year ago

htsの音素ラベルはnnmnkwiiを使用している。 https://r9y9.github.io/nnmnkwii/latest/references/generated/nnmnkwii.io.hts.load.html

kyakuno commented 1 year ago

推論コード。 https://github.com/nnsvs/nnsvs/blob/master/notebooks/Demos.ipynb

engine = create_svs_engine("r9y9/yoko_latest")

contexts = pysinsy.extract_fullcontext(nnsvs.util.example_xml_file("song070_f00001_063"))
labels = hts.HTSLabelFile.create_from_contexts(contexts)
wav, sr = engine.svs(labels)

Audio(wav, rate=sr)

kyakuno commented 1 year ago

歌声の合成を目指しており、f0も引数に与える形になっている。 https://r9y9.github.io/blog/2020/05/10/nnsvs/

kyakuno commented 1 year ago

スコアはmidiで与える。 https://r9y9.github.io/projects/nnsvs/

kyakuno commented 1 year ago

svsの中は複数モデルで構成されており、モデル分割してonnx変換が必要。 https://github.com/nnsvs/nnsvs/blob/master/nnsvs/svs.py

kyakuno commented 1 year ago

モデル一覧

timelag_model.pth : HTSの音素ラベルからtime-lagを予測
duration_model.pth : HTSの音素ラベルからdurationを予測
acoustic_model.pth : HTSの音素ラベルからfeaturesを予測
lf0_model.pth : acoustic_modelのオプション
vocoder_model.pth : featuresから音声波形を生成、world or usfgan
postfilter_model.pth : vocoderの出力の音声波形を整形

kyakuno commented 1 year ago

音声合成はgen.pyに記載されている。

predict_acousticにおいて、linguistic_featuresを取得し、acoustic_model.inferenceを行った後、denormalilzationを行っている。 linguistic_featuresはnnmnkwii.merlinからimportしている。 https://r9y9.github.io/nnmnkwii/latest/references/generated/nnmnkwii.frontend.merlin.linguistic_features.html https://github.com/r9y9/nnmnkwii/blob/master/nnmnkwii/frontend/merlin.py

predict_waveformにおいて、vocoder_typeがworldの場合はpyworldを使用して推論している。 vocoder_typeがpwgかusfganの場合、torchでf0_inpとaux_featesを推論している。

kyakuno commented 1 year ago

extract_fullcontextはpysinsyを使用している。pysinsyはsinsyのpython binding。sinsyは平仮名を入力して音素を出力する。 https://github.com/r9y9/pysinsy https://github.com/r9y9/sinsy

kyakuno commented 1 year ago

text -> sinsy [native] -> hts engine [native, openjtalk] -> labels
labels -> timelag_model, duration_model [torch] -> timelag, duration
labels, timelag, duration, f0 -> linguistic_features [marlin] -> lfeatures -> acoustic_model (torch) -> features
features -> vocoder (torch) -> postprocess (torch) -> wave