スマホ版VOICEVOXの開発

Hiroshiba commented 2 years ago

スマホ版VOICEVOXを作りたいです。

目的

VOICEVOXのバリューであるユーザー数の増加と、ミッションである音声合成キャラの浸透ができそうだからです。

背景

そもそも動画を作る人というのは、高校生・大学生が多いと思います。時間がないと作れないからです。今の高校生・大学生は基本的にスマホで物事を完結します。動画作成も例外ではないです（想像できませんが･･･）スマホで動く音声合成アプリは少なく、特に無料のものとなるとかなり数が少ないはずです。そこを攻めます。この領域は特に企業が参入しづらいはずです。どう頑張っても儲からないからです。ほとんど未開のこの領域に踏み込んでみたい、というのがこのプロジェクトの意図です。

ゴール

とりあえずTTSができるアプリのデモができればOKとしたいです。リリースに向けての動き方とかは後々に考える見込みです。

内容

開発はOSSベースを想定しています。いろんな方の力をお借りしたいからです。初手はiOSだけで良いと思います。日本語TTSを使うメインユーザーが日本のユーザーであり、かつデバイスの計算リソースが強めなためです。 UIフレームワークはReact Nativeを検討しています。VOICEVOXがjs製なのと、マルチプラットフォームに展開したいからです。

課題

一番の課題は、音声合成用の機械学習モデルの推論をどう実現するかだと思います。とりあえずCoreMLに変換する方法がありそうなので検討中です。ちょっと調べた感じ、onnxruntimeをスマホ用にビルドすることもできそうですが、前例がなかなか見つからず、前途多難な予感がしています。

２番めの課題は、openjtalkが必要な点です。これはこちらのプロジェクトのC++ TTSライブラリができ次第着手するのが効率がいいのかなと思っています。

３番めの課題はUIです。がんばってデザインしていきます。とりあえずアクセント調整だけできれば良いかなとも思っています。

その他

手が空き次第、僕が着手しようかなと思っていますが、他のタスクも多くなかなか手がつけられていません。もしご興味があればコメント等頂ければと思います！

HyodaKazuaki commented 2 years ago

一番の課題は、音声合成用の機械学習モデルの推論をどう実現するかだと思います。とりあえずCoreMLに変換する方法がありそうなので検討中です。ちょっと調べた感じ、onnxruntimeをスマホ用にビルドすることもできそうですが、前例がなかなか見つからず、前途多難な予感がしています。

こちらの件について、ONNX RuntimeをiOS向けにビルドするためのドキュメントがあったので共有しておきます。 CoreMLを利用する場合のビルドオプションに関する記述もあります。 https://onnxruntime.ai/docs/build/ios.html

また、CoreMLでサポートされているオペレーションについては以下のドキュメントに記載があります。 https://onnxruntime.ai/docs/execution-providers/CoreML-ExecutionProvider

Hiroshiba commented 2 years ago

ありがとうございます！同じページを見ていたのですが、ビルドしてみた報告ブログなどが見つからず、onnxruntimeのビルドがうまくいくかは修羅の道なのかもと思ってたりします。

オペレーション一覧はまだ見てませんでした。VOICEVOXの推論機構が全部表現できるかはパッとわからないですね。。足りてないのがあるかもしれない。。

Hiroshiba commented 2 years ago

でもOSSとして開発していくのであれば、おそらく暗号化済みのモデルファイルを共有する仕組みがない（？）CoreMLよりも、ローカルストレージにあるバイナリファイルからモデルをloadできるonnxruntimeのほうが筋が通っているように感じました。 onnxruntimeでやっていきたいですね！！

HyodaKazuaki commented 2 years ago

オペレーション一覧はまだ見てませんでした。VOICEVOXの推論機構が全部表現できるかはパッとわからないですね。。足りてないのがあるかもしれない。。

VOICEVOX/voicevox_core で公開されている各種onnxファイルとオペレーションが変わらないのであれば、そこから対応可能か確認できそうです。ちょっと確認してみます。

おそらく暗号化済みのモデルファイルを共有する仕組みがない（？）

この点については、CoreMLでモデルを暗号化して提供することはできそうです。 https://developer.apple.com/documentation/coreml/encrypting_a_model_in_your_app https://qiita.com/kazuhiro4949/items/becb1850172d2e96281f

また、CoreMLの形式もアプリにバンドルすることは可能なので、アプリのリリースとともにモデルを配布することもできそうです。 https://developer.apple.com/documentation/coreml/integrating_a_core_ml_model_into_your_app?changes=latest_minor

CoreMLとONNXのパフォーマンスの違いはおそらくないと思います。ですので、開発の方針として「他プラットフォームとの開発の差異を限りなく小さくすること」を優先するのであれば、ONNXモデルのまま利用できたほうがいいと思います。

HyodaKazuaki commented 2 years ago

yukarin_s.onnnx、yukarin_sa.onnx、decode.onnxの3つのONNXモデルについて、CoreML実行プロバイダを利用できるかオペレーションを確認してきました。以下の表が3つのONNXモデルで使っているオペレーションとその対応状況です。

かなり多くのオペレーションが対応していないので、CoreML実行プロバイダをONNXで使うのは難しそうです。

Operator	Supported?
Add	Yes
Cast	Yes
Concat	Yes
ConcatFromSequence	No
ConstantOfShape	No
Conv	Yes
ConvTranspose	No
Cos	No
Div	No
Equal	No
Expand	No
Gather	No
GRU	No
LeakyRelu	No
Loop	No
MatMul	Yes
Mul	No
Pow	No
Range	No
ReduceMean	No
Relu	Yes
Reshape	Yes
ScatterND	No
Shape	No
Sigmoid	Yes
Sin	No
Slice	No
Softmax	No
SplitToSequence	No
Sqrt	No
Sub	No
Tanh	Yes
Transpose	Yes
Unsqueeze	No
Where	No

参考として、CoreMLに変換する場合のことを書いておきます。 ONNXからCoreMLに変換する機能は、Core ML Toolsというツールが提供していますが、次のバージョンでONNXからの変換が廃止されるようです。 PyTorchから直接変換する機能は提供されています。 https://developer.apple.com/jp/documentation/coreml/converting_trained_models_to_core_ml/ https://coremltools.readme.io/docs/onnx-conversion https://coremltools.readme.io/docs/pytorch-conversion

Hiroshiba commented 2 years ago

VOICEVOX/voicevox_core で公開されている各種onnxファイルとオペレーションが変わらないのであれば

こちら、少なくとも今は変わってないです！

対応表ありがとうございます！！！！とても参考になります！！そして思った以上に未対応が多いですね･･･（cosとかsinとかどこで使ってるんだろうと思ったら、位置エンコーディングですね･･･）僕もonnxruntimeでCoreMLを使うのは（かなり）難しいと思いました。

CoreMLのこともありがとうございます。ローカルファイルから読む方法、あるんですね！であればこちらでも全然OSSとして開発できそうな印象を受けました。まあCoreMLを使う感じ･･･かなぁ･･･

一応他にも、onnxruntimeをCoreML使わずCPUで利用するとか、WebViewを経由してWebGL版onnxruntimeを使うとかの方法が考えられます。 WebViewを経由する方法はそれはそれでしんどそうなので微妙な気持ちですが、性能が良いらしいiPhoneであればCPU推論が意外と早いかもとちょっと思ってます。 CPU推論が実用に耐えうるかサクッと試したいかもですが、方法ありそうでしょうか👀

HyodaKazuaki commented 2 years ago

性能が良いらしいiPhoneであればCPU推論が意外と早いかもとちょっと思ってます。 CPU推論が実用に耐えうるかサクッと試したいかもですが、方法ありそうでしょうか👀

ONNX化の影響でCPU推論がかなり高速化されたので、もしかするとiPhoneやiPadでもCPUで十分快適に動作するかもしれません。 (とはいえ、現在サポートされているiPhoneやiPadの中には古いものもあるので、快適に利用できないものもありそうです) 現在、CocoaPods(iOSなど向けのライブラリ管理ツール)にonnxruntime(onnxruntime-mobile-c)があります。これを使えば、ONNXモデルが動作するか、そしてどれぐらいの処理速度かを確認することはできそうです。

Hiroshiba commented 2 years ago

おーーなるほどです！！割と簡単に確かめられるかもなんですね！！

Hiroshiba commented 2 years ago

wasmでどれくらい速度が出るのか調べるために、onnxruntime-webを用いてonnxモデルで推論してみるコードを書いてみました。 https://github.com/Hiroshiba/vv_check_web/tree/6809d140e526eeaa109d64d3483329f63ee71a51

PC上でブラウザを開いてCPUを用いて推論したところ、５秒ほどの音声を生成するのに１０秒ほどかかりました。ネイティブで生成した場合はCPUでも１秒未満で完了するので、比較するとざっと１０倍ほど遅そうです。さすがに使えなさそう。

また、onnxruntime-webはWebGLモードもあるのですが、対応していないものがあって推論できませんでした。ちなみにTypeError: int64 is not supportedというエラーでした。

WebGLを用いてどれくらい早くなるのかを確かめたい気持ちがあります。 onnxモデル作成コードはこちらにあります。

Hiroshiba commented 2 years ago

onnxruntime-webのthreadingを有効にした状態で検証してみました。（thx @yamachu !!! ） https://github.com/Hiroshiba/vv_check_web/tree/9adb272b576e3c125432459ee32fe6119658ac0f 時間は大幅に縮まりましたが、Core i7-11700で5秒の音声を生成するのに3.4秒かかり、まだやっぱりちょっと遅いなという印象でした。

WebGLを使うルートも検証し始めました。 pytorchモデルの中の処理を変える必要がある、というのがわかってきました。ご興味あればぜひ一緒に検証しましょう･･･！！！

https://github.com/Hiroshiba/vv_core_inference/issues/4

Patchethium commented 2 years ago

Besides CoreML, I suggest considering NCNN or tract for mobile deployment, they run on native code. Although it makes use of WebGL, wasm can still be pretty slow.

Hiroshiba commented 2 years ago

NCNN, good one! Due to encryption, I would like to load the model from memory (not from a file), but I couldn't find in the documentation if it is possible. ;->

Patchethium commented 2 years ago

Check this tutorial, ncnn supports stripping readable information.

Hiroshiba commented 2 years ago

Great!!! I will try to convert it to NCNN model.

Patchethium commented 2 years ago

Great, BTW if you're converting from pytorch, it's recommended to give ncnn's pnnx tool a try. It can directly convert the pytorch module to ncnn without generating redundant OPs like in ONNX.

Hiroshiba commented 2 years ago

I tried to convert using ncnn from onnx, but there seemed to be a lot of errors! ;->

```bash Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Gather not supported yet! Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! ConstantOfShape not supported yet! # value 4 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! ConstantOfShape not supported yet! # value 4 Shape not supported yet! Gather not supported yet! # axis=0 Cast not supported yet! # to=1 Range not supported yet! Gather not supported yet! # axis=2 Shape not supported yet! Expand not supported yet! Shape not supported yet! Gather not supported yet! # axis=0 Cast not supported yet! # to=7 Range not supported yet! Shape not supported yet! Gather not supported yet! # axis=0 Cast not supported yet! # to=7 Range not supported yet! Shape not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! Unknown data type 0 ScatterND not supported yet! Gather not supported yet! # axis=2 Shape not supported yet! Expand not supported yet! Shape not supported yet! Gather not supported yet! # axis=0 Cast not supported yet! # to=7 Range not supported yet! Shape not supported yet! Gather not supported yet! # axis=0 Cast not supported yet! # to=7 Range not supported yet! Shape not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! Unknown data type 0 ScatterND not supported yet! Gather not supported yet! # axis=2 Shape not supported yet! Expand not supported yet! Shape not supported yet! Gather not supported yet! # axis=0 Cast not supported yet! # to=7 Range not supported yet! Shape not supported yet! Gather not supported yet! # axis=0 Cast not supported yet! # to=7 Range not supported yet! Shape not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! Unknown data type 0 ScatterND not supported yet! Gather not supported yet! # axis=2 Shape not supported yet! Expand not supported yet! Shape not supported yet! Gather not supported yet! # axis=0 Cast not supported yet! # to=7 Range not supported yet! Shape not supported yet! Gather not supported yet! # axis=0 Cast not supported yet! # to=7 Range not supported yet! Shape not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! ConstantOfShape not supported yet! # value 4 Equal not supported yet! Where not supported yet! Expand not supported yet! Shape not supported yet! Unknown data type 0 ScatterND not supported yet! Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unknown data type 0 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unknown data type 0 Unsupported slice step ! Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Shape not supported yet! Gather not supported yet! # axis=0 Cast not supported yet! # to=7 Cast not supported yet! # to=7 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Cast not supported yet! # to=7 Cast not supported yet! # to=7 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Unknown data type 0 Unknown data type 0 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unknown data type 0 Unsupported unsqueeze axes ! Unknown data type 0 Unsupported unsqueeze axes ! Unknown data type 0 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unknown data type 0 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! ConstantOfShape not supported yet! # value 4 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Unknown data type 0 Shape not supported yet! Unknown data type 0 Shape not supported yet! Unsupported squeeze axes ! Cast not supported yet! # to=7 Cast not supported yet! # to=7 Unsupported unsqueeze axes ! Unknown data type 0 Shape not supported yet! Gather not supported yet! # axis=0 Equal not supported yet! Cast not supported yet! # to=9 Where not supported yet! Cast not supported yet! # to=9 Where not supported yet! Unsupported unsqueeze axes ! Unknown data type 0 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unknown data type 0 Unsupported unsqueeze axes ! Unknown data type 0 Unsupported unsqueeze axes ! Unknown data type 0 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unknown data type 0 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! ConstantOfShape not supported yet! # value 4 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Shape not supported yet! Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Unsupported unsqueeze axes ! Unknown data type 0 Shape not supported yet! Unknown data type 0 Shape not supported yet! Unsupported squeeze axes ! Cast not supported yet! # to=7 Cast not supported yet! # to=7 Unsupported unsqueeze axes ! Unknown data type 0 Shape not supported yet! Gather not supported yet! # axis=0 Equal not supported yet! Cast not supported yet! # to=9 Where not supported yet! Cast not supported yet! # to=9 Where not supported yet! Unsupported unsqueeze axes ! Unknown data type 0 Gather not supported yet! # axis=0 Unsupported unsqueeze axes ! Gather not supported yet! # axis=0 Gather not supported yet! # axis=0 ```

Hiroshiba commented 2 years ago

Great, BTW if you're converting from pytorch, it's recommended to give ncnn's pnnx tool a try.

I didn't know there was such a thing! It's a bit of effort as it requires torch script, but I'd like to give it a try. (It looks like I could get ncnn params and bin, but it doesn't say if this will work on ncnn...)

I see that pnnx was in a separate repository. I will try to use the exe distributed in the releases here.

Patchethium commented 2 years ago

Check the second line of its README:

Note: The current implementation is in https://github.com/Tencent/ncnn/tree/master/tools/pnnx

Apparently they merged pnnx into ncnn's repo.

Hiroshiba commented 2 years ago

Oh, I know that one! I didn't find the executable binary in ncnn/tools/pnnx, but I did find it in pnnx/pnnx. Thanks!

Hiroshiba commented 2 years ago

I tried pnnx! I found that execution stopped without any useful error messages.

The .pt file can be found here. The hiho_decode_script_cpu.pt is the target you want to onnx convert.

The shape I'm inputting looks right at [-1,1],[-1,45],[1]i64.... It seems difficult.
https://github.com/Hiroshiba/yukarin_soso_connector/blob/b875c25a1f2e331c3647a26a692316a9e38d634e/yukarin_soso_connector/jit_forwarder/jit_forwarder.py#L255-L259

```bash $ ./pnnx/pnnx.exe hiho_decode_script_cpu.pt inputshape=[100,1],[100,45],[1]i64 inputshape2=[200,1],[200,45],[1]i64 pnnxparam = hiho_decode_script_cpu.pnnx.param pnnxbin = hiho_decode_script_cpu.pnnx.bin pnnxpy = hiho_decode_script_cpu_pnnx.py ncnnparam = hiho_decode_script_cpu.ncnn.param ncnnbin = hiho_decode_script_cpu.ncnn.bin ncnnpy = hiho_decode_script_cpu_ncnn.py optlevel = 2 device = cpu inputshape = [100,1]f32,[100,45]f32,[1]i64 inputshape2 = [200,1]f32,[200,45]f32,[1]i64 customop = moduleop = ############# pass_level0 inline function is_tracing inline function pad_sequence inline function pad_sequence inline function make_pad_mask inline function make_non_pad_mask inline module = espnet_pytorch_library.conformer.convolution.ConvolutionModule inline module = espnet_pytorch_library.conformer.encoder.Encoder inline module = espnet_pytorch_library.conformer.encoder_layer.EncoderLayer inline module = espnet_pytorch_library.conformer.swish.Swish inline module = espnet_pytorch_library.transformer.attention.RelPositionMultiHeadedAttention inline module = espnet_pytorch_library.transformer.embedding.RelPositionalEncoding inline module = espnet_pytorch_library.transformer.layer_norm.LayerNorm inline module = espnet_pytorch_library.transformer.multi_layer_conv.MultiLayeredConv1d inline module = espnet_pytorch_library.transformer.repeat.MultiSequential inline module = hifi_gan.models.Generator inline module = hifi_gan.models.ResBlock1 inline module = yukarin_soso_connector.jit_forwarder.jit_yukarin_sosoa.JitPostnet inline module = yukarin_soso_connector.jit_forwarder.jit_yukarin_sosoa.JitYukarinSosoa inline function is_tracing inline function pad_sequence inline function pad_sequence inline function make_pad_mask inline function make_non_pad_mask inline module = espnet_pytorch_library.conformer.convolution.ConvolutionModule inline module = espnet_pytorch_library.conformer.encoder.Encoder inline module = espnet_pytorch_library.conformer.encoder_layer.EncoderLayer inline module = espnet_pytorch_library.conformer.swish.Swish inline module = espnet_pytorch_library.transformer.attention.RelPositionMultiHeadedAttention inline module = espnet_pytorch_library.transformer.embedding.RelPositionalEncoding inline module = espnet_pytorch_library.transformer.layer_norm.LayerNorm inline module = espnet_pytorch_library.transformer.multi_layer_conv.MultiLayeredConv1d inline module = espnet_pytorch_library.transformer.repeat.MultiSequential inline module = hifi_gan.models.Generator inline module = hifi_gan.models.ResBlock1 inline module = yukarin_soso_connector.jit_forwarder.jit_yukarin_sosoa.JitPostnet inline module = yukarin_soso_connector.jit_forwarder.jit_yukarin_sosoa.JitYukarinSosoa 51 52 length.1 f00.1 phoneme.1 h.1 h0 h2.1 maxlen.1 seq_range.1 105 seq_range_expand.1 seq_length_expand.1 mask.5 111 113 mask.3 120 122 123 124 x.8 131 132 134 135 136 1094 139 1096 140 142 143 1100 145 input.2 147 148 150 151 153 154 161 bias.3 weight.3 x.3 input.8 185 186 input0.29 188 input1.25 190 input2.27 192 193 input.10 bias.5 weight.5 query.2 202 204 205 pos_bias_v.2 pos_bias_u.2 n_batch.2 234 q.2 237 k.2 240 v.2 q0.2 k0.2 value.2 q1.2 n_batch_pos.2 250 p.2 p0.2 254 q_with_bias_u.2 256 q_with_bias_v.2 258 matrix_ac.2 260 x.5 263 266 269 zero_pad.2 x_padded.2 276 279 282 283 286 x_padded0.2 290 291 292 293 295 296 1160 297 299 300 301 matrix_bd.2 303 scores.2 n_batch0.2 308 mask.2 scores0.2 311 input.12 313 x0.2 315 316 input0.10 319 320 input0.12 bias.7 weight.7 x.7 input.14 336 input0.14 338 339 340 input.16 342 343 344 input1.10 bias.9 weight.9 x.9 input.18 359 360 input0.16 362 input1.12 364 input2.8 366 1198 367 input2.10 bias.11 weight.11 input0.18 377 bias.2 weight.2 x.2 input.31 401 402 input0.35 404 input1.31 406 input2.25 408 409 input.6 bias.4 weight.4 query.1 418 420 421 pos_bias_v.1 pos_bias_u.1 n_batch.1 450 q.1 453 k.1 456 v.1 q0.1 k0.1 value.1 q1.1 n_batch_pos.1 466 p.1 p0.1 470 q_with_bias_u.1 472 q_with_bias_v.1 474 matrix_ac.1 476 x.4 479 482 485 zero_pad.1 x_padded.1 492 495 498 499 502 x_padded0.1 506 507 508 509 511 512 1255 513 515 516 517 matrix_bd.1 519 scores.1 n_batch0.1 524 mask.1 scores0.1 527 input.27 529 x0.1 531 532 input0.37 535 536 input0.33 bias.6 weight.6 x.6 input.4 553 input0.25 555 556 557 input.29 559 560 561 input1.29 bias.8 weight.8 x.1 input.33 576 577 input0.27 579 input1.27 581 input2.31 583 1293 584 input2.29 bias.10 weight.10 input0.31 bias.1 weight.1 599 h3.1 602 output1.1 606 input0.2 input1.2 input2.2 xs0.2 input0.4 input1.4 input2.4 xs1.2 input0.6 input1.6 input2.6 xs2.2 input0.8 input1.8 input2.33 xs3.2 input0.39 input1.33 650 651 output2.1 spec.1 x.10 20 663 700 input.3 702 input.5 718 input8.1 720 input9.1 input10.1 723 input11.1 725 input12.1 input13.1 728 input14.1 730 xs.5 input.7 747 input0.5 749 input1.5 input2.5 752 input3.5 754 input4.5 input5.5 757 input6.5 759 760 xs.3 input.9 777 input0.7 779 input1.7 input2.7 782 input3.7 784 input4.7 input5.7 787 input6.7 789 790 xs0.1 input0.3 input1.3 794 input.11 810 input0.9 812 input1.9 input2.9 815 input3.9 817 input4.9 input5.9 820 input6.9 822 xs.7 input.13 839 input0.11 841 input1.11 input2.11 844 input3.11 846 input4.11 input5.11 849 input6.11 851 852 xs1.1 input.15 869 input0.13 871 input1.13 input2.13 874 input3.13 876 input4.13 input5.13 879 input6.13 881 882 xs2.1 1351 input2.3 input3.3 886 input.17 902 input0.15 904 input1.15 input2.15 907 input3.15 909 input4.15 input5.15 912 input6.15 914 xs.9 input.19 931 input0.17 933 input1.17 input2.17 936 input3.17 938 input4.17 input5.17 941 input6.17 943 944 xs3.1 input.21 961 input0.19 963 input1.19 input2.19 966 input3.19 968 input4.19 input5.19 971 input6.19 973 974 xs4.1 1376 input4.3 input5.3 978 input.23 994 input0.21 996 input1.21 input2.21 999 input3.21 1001 input4.21 input5.21 1004 input6.21 1006 xs.1 input.25 1023 input0.23 1025 input1.23 input2.23 1028 input3.23 1030 input4.23 input5.23 1033 input6.23 1035 1036 xs5.1 input.1 1053 input0.1 1055 input1.1 input2.1 1058 input3.1 1060 input4.1 input5.1 1063 input6.1 1065 1066 xs6.1 1401 input6.3 input7.1 1070 1071 23 ---------------- ```

Patchethium commented 2 years ago

The error message is very useful. For the decoder,

terminate called after throwing an instance of 'c10::Error'
  what():  forward() Expected a value of type 'List[Tensor]' for argument 'f0_list' but instead found type 'Tensor'.

It says that you specified the f0_list in forward call to be List[Tensor] but in pnnx you use [-1,1] which means a 2d Tensor. I think you may fix it by stacking the list of f0 into one Tensor. Also, do the same thing to the phoneme list.

I also tried out the yukarin_s and yukarin_sa, got this error from both of them:

RuntimeError: index out of range in self

at the forward call of

self.speaker_embedder

I think this might could be fixed by specifying an example_input in jit export, with a speaker id no larger than the embedding size.

I'd like to fix them myself but I don't have access to the original models so \_(ツ)_/

Hiroshiba commented 2 years ago

It's true! I ran the ubuntu version and got an error!!!

I'd like to fix them myself but I don't have access to the original models so _(ツ)_/

I see! The binary data of the models can be found here. https://github.com/Hiroshiba/vv_core_inference/releases/tag/0.0.1

The network structure of the model can be found here. https://github.com/Hiroshiba/yukarin_soso_connector

The conversion to torch script can be done with the following code.

python run_jit.py \
    --yukarin_s_model_dir "model/yukarin_s" \
    --yukarin_sa_model_dir "model/yukarin_sa" \
    --yukarin_sosoa_model_dir "model/yukarin_sosoa" \
    --hifigan_model_dir "model/hifigan" \
    --texts "hello" \
    --speaker_ids 0 1

Hiroshiba commented 2 years ago

I've changed List[Tensor] to Tensor! Working on this branch. https://github.com/Hiroshiba/yukarin_soso_connector/tree/to-ncnn

I ran the above code to get a new .pt file and the level0 optimization passed through 🎉. And I got a wonderful error in level1 optimization. ;->

############# pass_level1
no attribute value
Segmentation fault

2022/06/24　I created the issue.

Patchethium commented 2 years ago

Sorry recently I didn't have time to check it out 🙇

creates the issue

I guess it's better this way, the maintainer of ncnn is actively involved in the community and would give solutions way better than mine. Nevertheless, I'll keep tracking this issue whenever I have the time.

Hiroshiba commented 2 years ago

decodeのncnn用のバイナリができました！ https://github.com/Hiroshiba/vv_core_inference/releases/tag/ncnn

pnnx経由でncnn化する制約としてtorch.jit.traceを使う必要があるのですが、その影響でyukarin_saの自己回帰が使えず、saのncnn化ができてません。

Patchethium commented 1 year ago

Have you tried it out? Actually I didn't see any issues with tracing an auto regressive model, see this tutorial.

Hiroshiba commented 1 year ago

Thanks for letting me know! In this example, the autoregression code was written in GreedySearchDecoder, where torch.jit.script was used instead of trace.

Patchethium commented 1 year ago

Sorry I wasn't around for a period, I went out to try other frameworks, ncnn, tvm, openvino, TNN, tract... and ended up with Alibaba's MNN.

Like NCNN is (kinda) from Tencent, MNN is also made by a Chinese Big Tech Alibaba, the one running AliExpress. It could either be an advantage or disadvantage, fortunately it has an English doc for non-Chinese speaker.

Anyway, I was able to convert the onnx model here to MNN format with little tweaking. predict-duration and predict-intonation works out-of-box, while on decoder I only need to change an axes attribute. It's just amazing in regard to NCNN which can't even run Unsqueeze.

Compile MNN Convert Tool

```bash git clone https://github.com/alibaba/MNN.git cd MNN mkdir build cmake .. -DMNN_BUILD_CONVERTER=ON make -j4 # convert ./MNNConvert -f ONNX --modelFile predict-duration.onnx --MNNModel predict-duration.mnn --bizCode biz # test the inference result python ../tools/script/fastTestOnnx.py ./onnx/predict-duration.onnx ```

Modify the decoder

```python import onnx model = onnx.load("decode-0.onnx") node = next(n for n in model.graph.node if n.name == "Unsqueeze_481") node.attribute.remove(node.attribute[0]) axes_attr = onnx.helper.make_attribute("axes", [0]) node.attribute.insert(0, axes_attr) onnx.save(model, "./onnx/decode-0-modified.onnx") ```

I haven't written any deployment or inference code yet since I don't have Android Studio or XCode on my laptop.

Edit: n.op_type -> n.name

Hiroshiba commented 1 year ago

That's great !!!!!!!!!!!!! I'm very interested whether it will work on a smart phone or not !!!!!

Patchethium commented 1 year ago

It works, if you go to the docs' about page you'll see

● iOS platform: static library size for armv7+arm64 platforms is about 5MB, size increase of linked executables is about 620KB, and metallib file is about 600KB.
● Android platform: core so size is about 400KB, OpenCL so is about 400KB, Vulkan so is about 400KB.

Originally it was made for mobile platforms, just like NCNN.

sevenc-nanashi commented 1 year ago

ここ数日間でのDiscord会話や自分が試してわかったことからタスクリストを作ってみました。

（ voicevox/voicevox_mobile#28 に移動）

sevenc-nanashi commented 1 year ago

新設計APIを使えばエンジンのJS実装部分を減らすことができそうだったので、それを使うようにタスクリストを更新しました。

VOICEVOX / voicevox_project

スマホ版VOICEVOXの開発 #10

目的

背景

ゴール

内容

課題

その他