CUDA版実行時に「Exception: 無効なmodel_indexです: 0」が出ることがある

Hiroshiba commented 1 year ago

不具合の内容

CUDA版実行時にException: 無効なmodel_indexです: 0と出るという報告を２回頂きました。１回目はこちらで、Linux版とのことです https://github.com/VOICEVOX/voicevox_engine/issues/513#issuecomment-1368194354 ２回目はDMで頂いて、Windows版でした。DirectML版は普通に大丈夫だったとのことです。

実際にLinux CUDA版エンジンを起動してみたのですが、自分の場合はうまく行ってしまいました。

現象・ログ

```bash INFO: IPv6 - "POST /audio_query?text=%E3%81%BEl%E3%81%AEE%E3%81%A6evHH%E3%81%93&speaker=5 HTTP/1.1" 500 Internal Server Error ERROR: Exception in ASGI application Traceback (most recent call last): File "uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi File "uvicorn/middleware/proxy_headers.py", line 75, in __call__ File "fastapi/applications.py", line 208, in __call__ File "starlette/applications.py", line 112, in __call__ File "starlette/middleware/errors.py", line 181, in __call__ File "starlette/middleware/errors.py", line 159, in __call__ File "starlette/middleware/base.py", line 57, in __call__ File "anyio/_backends/_asyncio.py", line 567, in __aexit__ File "starlette/middleware/base.py", line 30, in coro File "starlette/middleware/cors.py", line 84, in __call__ File "starlette/exceptions.py", line 82, in __call__ File "starlette/exceptions.py", line 71, in __call__ File "starlette/routing.py", line 656, in __call__ File "starlette/routing.py", line 259, in handle File "starlette/routing.py", line 61, in app File "fastapi/routing.py", line 226, in app File "fastapi/routing.py", line 161, in run_endpoint_function File "starlette/concurrency.py", line 39, in run_in_threadpool File "anyio/to_thread.py", line 28, in run_sync File "anyio/_backends/_asyncio.py", line 805, in run_sync_in_worker_thread File "anyio/_backends/_asyncio.py", line 743, in run File "run.py", line 209, in audio_query File "voicevox_engine/synthesis_engine/synthesis_engine_base.py", line 177, in create_accent_phrases File "voicevox_engine/synthesis_engine/synthesis_engine_base.py", line 162, in replace_mora_data File "voicevox_engine/synthesis_engine/synthesis_engine.py", line 237, in replace_phoneme_length File "voicevox_engine/synthesis_engine/core_wrapper.py", line 463, in yukarin_s_forward Exception: 無効なmodel_indexです: 0 ```

再現手順

linux用コマンドです。起動

set -eux

# rm -rf /tmp/voicevox_engine
mkdir -p /tmp/voicevox_engine
cd /tmp/voicevox_engine

# linux-nvidia.7zがない場合
if [ ! -f linux-nvidia.7z ]; then
    wget https://github.com/VOICEVOX/voicevox_engine/releases/download/0.14.0-preview.5/linux-nvidia.7z.001
    mv linux-nvidia.7z.001 linux-nvidia.7z
fi

# linux-nvidiaがない場合
if [ ! -d linux-nvidia ]; then
    7z x linux-nvidia.7z
    chmod +777 linux-nvidia/run
fi

# run
./linux-nvidia/run --use_gpu

クエリ投げ

echo -n "こんにちは、音声合成の世界へようこそ" >text.txt

curl -s \
    -X POST \
    "localhost:50021/audio_query?speaker=1"\
    --get --data-urlencode text@text.txt \
    > query.json

curl -s \
    -H "Content-Type: application/json" \
    -X POST \
    -d @query.json \
    "localhost:50021/synthesis?speaker=1" \
    > audio.wav

期待動作

普通に音声合成できる

VOICEVOXのバージョン

0.14.0-preview

OSの種類/ディストリ/バージョン

[x] Windows
[x] Linux

その他

謎です。

okaits commented 1 year ago

このレポジトリをクローンして、

sudo make run-linux-docker-nvidia-ubuntu20.04

で実行した時にもこのエラーが出ることを確認しました。(sudo無しで実行すると正常に動作しました)

おま環かもしれませんが一応: リクエストを飛ばすたびに次のようなメッセージが大量に出力されました。(タイムスタンプはカット) (こちらもsudo無しで消えました)

INFO onnxruntime::onnxruntime: "Removing initializer \'636\'. It is not used by any node and should be removed from the model."

DockerのイメージIDは84549397d23cです。

Hiroshiba commented 1 year ago

@okaits 報告ありがとうございます！

sudo有無で変わるということは･･･どういうことだろう･･･。 make中にエラーが出る感じでしょうか、はたまたmakeしたあとdocker runしたときにエラーが出る感じでしょうか。

メッセージが大量に出る件に関しては仕様で、最新のcoreを用いることで解消すると思います！こちらもsudo有無で変わるということは、見ているcoreが切り替わっているかも･･･？

ちなみにdocker iamgeはpushしていない限りこちらからアクセスできないので、イメージIDを知ってもこちらからは何もできないですね･･･！またなにかわかったらご報告いただければ！！

okaits commented 1 year ago

今、通常ユーザーとrootでイメージを再取得したところこうなりました	環境	通常ユーザー
NVIDIA (Docker)	成功	失敗
CPU (Docker)	成功	成功
NVIDIA (ローカル環境)	成功	成功
CPU (ローカル環境)	成功	成功

項目名	値
CUDA Toolkit (ホスト環境)	12.0
NVIDIA Driver (ホスト環境)	525.78.01
nvidia-docker2	2.11.0-1
OS (ホスト環境)	Ubuntu 22.10
`uname -r`	6.2.0-rc1
Docker	23.0.0-rc.2, build 257ff41

エラーメッセージ

INFO:     172.17.0.1:41592 - "POST /audio_query?speaker=3&text=何らかのテキスト HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/home/user/.local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/fastapi/applications.py", line 208, in __call__
    await super().__call__(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/home/user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/middleware/base.py", line 57, in __call__
    task_group.cancel_scope.cancel()
  File "/home/user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/home/user/.local/lib/python3.8/site-packages/starlette/middleware/base.py", line 30, in coro
    await self.app(scope, request.receive, send_stream.send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/middleware/cors.py", line 84, in __call__
    await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/home/user/.local/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/routing.py", line 61, in app
    response = await func(request)
  File "/home/user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 226, in app
    raw_response = await run_endpoint_function(
  File "/home/user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 161, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/concurrency.py", line 39, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/home/user/.local/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "./run.py", line 221, in audio_query
    accent_phrases = engine.create_accent_phrases(text, speaker_id=speaker)
  File "/opt/voicevox_engine/voicevox_engine/synthesis_engine/synthesis_engine_base.py", line 183, in create_accent_phrases
    accent_phrases = self.replace_mora_data(
  File "/opt/voicevox_engine/voicevox_engine/synthesis_engine/synthesis_engine_base.py", line 168, in replace_mora_data
    accent_phrases=self.replace_phoneme_length(
  File "/opt/voicevox_engine/voicevox_engine/synthesis_engine/synthesis_engine.py", line 233, in replace_phoneme_length
    phoneme_length = self.core.yukarin_s_forward(
  File "/opt/voicevox_engine/voicevox_engine/synthesis_engine/core_wrapper.py", line 463, in yukarin_s_forward
    raise Exception(
Exception: 無効なmodel_indexです: 0

ちなみにdocker iamgeはpushしていない限りこちらからアクセスできないので、イメージIDを知ってもこちらからは何もできないですね･･･！またなにかわかったらご報告いただければ！！

...すみませんてっきりgitみたいにDockerもpull後のイメージID変わらないと思ってました

'23/01/30 23:35追記: ローカル環境での実行結果

Hiroshiba commented 1 year ago

おーー、なるほどです。ほんとに謎ですね･･･。

ちょっと当て推量なのですが、sudo実行の有無でnvidia-dockerが使われたり使われなかったりとかされてませんか･･･？そうでない場合は、docker run中にdockerコンテナにexecして、中にあるcoreライブラリに対してlddなどを実行したり、modelディレクトリの有無などを調べるとなにかわかるかもしれません。

model_index 0が見つからないというのが起こる理由がさっぱりわからず、「音声ライブラリの元になるモデルファイルがない（見つけられてない）」とか「なぜか共有ライブラリが一部ない」などが起因しているのかなと考えています。 sudo有無で変わるということはおそらく @okaits さんの環境でしか起こらないであろう課題で、かつこのissueは迷宮入りしているので、調査をお願いできるととても助かる感じです 🙇‍♂️

okaits commented 1 year ago

coreライブラリに対してのlddの結果は、微妙に違うだけで特に違いは見つかりませんでした。両方とも/opt/voicevox_core/modelの中身は一緒で、ハッシュ値や権限も一致しています。

何が原因かわからなかったので、とりあえず通常ユーザーと特権ユーザーの２つのコンテナをまるごとdocker cpして、diffで比較した結果、両方ともMakefileに記載されているコマンドのまま実行したのにも関わらず、なぜかrootのときだけGPU関連のコマンドなどがないことがわかったので、nvidia-docker2に問題が生じていることが発覚しました。

そして、nvidia-docker2についてググっていたら、次の方法で解決しました。

次のパッチを/etc/nvidia-container-runtime/config.tomlに当てる


diff -ur a/config.toml b/config.toml
--- a/config.toml   2023-01-31 01:10:33.777739291 +0900
+++ b/config.toml   2023-01-31 01:10:37.870139988 +0900
@@ -10,7 +10,7 @@
#debug = "/var/log/nvidia-container-toolkit.log"
#ldcache = "/etc/ld.so.cache"
load-kmods = true
-no-cgroups = true
+no-cgroups = false
#user = "root:video"
ldconfig = "@/sbin/ldconfig.real"


2. dockerサービスを再起動
3. docker runに`--runtime=nvidia`をつけて実行

([この記事](https://qiita.com/k_ikasumipowder/items/5e71208b7c7ae3e4fe7c)から)

つまり、恐らく、`Exception: 無効なmodel_indexです: 0`の発生条件は、「NVIDIAドライバーが動かない」だと思います。

Hiroshiba commented 1 year ago

詳細な調査と報告ありがとうございます！！

仰る通り @okaits さんの場合は、nvidia-dockerが動作してなかった（？）ことによってドライバ周りの環境が想定と異なっていたからなのかなと思いました！

ちなみにその前後などでエラーログがあったりしましたか？ INFO onnxruntime::onnxruntime: "Removing initializer以外で「onnxruntime」がどうみたいなログがあればここにメモしておいて頂けると、同様の課題を抱えている他の方がたどり着けるかもしれません。

ご報告ありがとうございました！！！

Hiroshiba commented 1 year ago

（開発者向けメモです）ドライバがなくても、モデルを読み込むタイミングではなく、実際に推論するときにエラーが出るんですね･･･。ソースコードを見返してみた感じ、おそらくGPUメモリが不足した場合などもこのエラーが出そうな気がしました。

Hiroshiba commented 1 year ago

Linuxで似たようなエラーがあった @kuroneko6423 さんももしかしたら同じ原因かも？

あとWindowsで同じ症状の方はもしかしたらCUDA非対応のGPUを使われてたとかかもしれません。ちょっと聞いてみたいと思います。

もうちょっと情報が整理できたらこのissueは一旦closeにしたいなと思います。

okaits commented 1 year ago

ちなみにその前後などでエラーログがあったりしましたか？ INFO onnxruntime::onnxruntime: "Removing initializer以外で「onnxruntime」がどうみたいなログがあればここにメモしておいて頂けると、同様の課題を抱えている他の方がたどり着けるかもしれません。

見たところないですね... 実行コマンドや出力をコピペで貼っておきます

実行コマンド/出力

$ sudo docker run --rm -it -p '127.0.0.1:50021:50021' voicevox/voicevox_engine:nvidia-ubuntu20.04-latest
+ cat /opt/voicevox_engine/README.md
(README.mdはカット)
+ exec gosu user /opt/python/bin/python3 ./run.py --use_gpu --voicelib_dir /opt/voicevox_core/ --runtime_dir /opt/onnxruntime/lib --host 0.0.0.0
Warning: cpu_num_threads is set to 0. ( The library leaves the decision to the synthesis runtime )
INFO:     Started server process [1]
INFO:     Waiting for application startup.
reading /tmp/tmp9x35fues ... 51
emitting double-array: 100% |###########################################| 

done!
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:50021 (Press CTRL+C to quit)
INFO:     172.17.0.1:40770 - "POST /audio_query?speaker=3&text=%E3%81%82%E3%81%84%E3%81%86%E3%81%88%E3%81%8A HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.8/site-packages/uvicorn/protocols/http/h11_impl.py", line 373, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/home/user/.local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 75, in __call__
    return await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/fastapi/applications.py", line 208, in __call__
    await super().__call__(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/applications.py", line 112, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc
  File "/home/user/.local/lib/python3.8/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/middleware/base.py", line 57, in __call__
    task_group.cancel_scope.cancel()
  File "/home/user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 662, in __aexit__
    raise exceptions[0]
  File "/home/user/.local/lib/python3.8/site-packages/starlette/middleware/base.py", line 30, in coro
    await self.app(scope, request.receive, send_stream.send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/middleware/cors.py", line 84, in __call__
    await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc
  File "/home/user/.local/lib/python3.8/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/routing.py", line 656, in __call__
    await route.handle(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/routing.py", line 259, in handle
    await self.app(scope, receive, send)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/routing.py", line 61, in app
    response = await func(request)
  File "/home/user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 226, in app
    raw_response = await run_endpoint_function(
  File "/home/user/.local/lib/python3.8/site-packages/fastapi/routing.py", line 161, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/home/user/.local/lib/python3.8/site-packages/starlette/concurrency.py", line 39, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/home/user/.local/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/user/.local/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "./run.py", line 221, in audio_query
    accent_phrases = engine.create_accent_phrases(text, speaker_id=speaker)
  File "/opt/voicevox_engine/voicevox_engine/synthesis_engine/synthesis_engine_base.py", line 183, in create_accent_phrases
    accent_phrases = self.replace_mora_data(
  File "/opt/voicevox_engine/voicevox_engine/synthesis_engine/synthesis_engine_base.py", line 168, in replace_mora_data
    accent_phrases=self.replace_phoneme_length(
  File "/opt/voicevox_engine/voicevox_engine/synthesis_engine/synthesis_engine.py", line 233, in replace_phoneme_length
    phoneme_length = self.core.yukarin_s_forward(
  File "/opt/voicevox_engine/voicevox_engine/synthesis_engine/core_wrapper.py", line 463, in yukarin_s_forward
    raise Exception(
Exception: 無効なmodel_indexです: 0
^CINFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1]

okaits commented 1 year ago

(内容がおかしいことに気づいてgit reset --hard HEAD^ と git push --forceしたのがバレてしまった...)

kuroneko6423 commented 1 year ago

Linuxで似たようなエラーがあった @kuroneko6423 さんももしかしたら同じ原因かも？

あとWindowsで同じ症状の方はもしかしたらCUDA非対応のGPUを使われてたとかかもしれません。ちょっと聞いてみたいと思います。

もうちょっと情報が整理できたらこのissueは一旦closeにしたいなと思います。

RTX3060なのでcudaは対応してますね。また、ドライバー/cudaはもちろん入れています。

Hiroshiba commented 1 year ago

なるほどです。ドライバーが思っているように動いてそうか、別のGPUソフトなどを用いて試して頂くのが良いのかなと思いました。

maekawatoshiki commented 1 year ago

このエラー（Exception: 無効なmodel_indexです: 0）に遭遇しましたが、 nvidia-smiで表示されるバージョンのCUDA以外に、他のCUDA（の残骸）がインストールされているからでした。私の場合、/usr/local/に手動でインストールした古いバージョンのCUDAが残されていたので、消してみたら正しく動作しました。

（同じような状況の方がいるかはわかりませんが、一応書き残しておきます）（その他に、nvidiaドライバをrmmod -fしてreloadすると動いたりもしました）

Hiroshiba commented 1 year ago

@maekawatoshiki ご報告ありがとうございます！！！ CUDAインストール済み環境はいろいろ試されている方であればそこそこの頻度で遭遇されそうなので、とても有用な情報だと思います･･･！

tarepan commented 7 months ago

詳細かつ明確なレポートで、同様の問題に当たった人が非常に助かる issue だと感じました。

@Hiroshiba
VOICEVOX Engine として解決できる問題は無さそうなので、issue は close で良さそうです（closeしても検索にはちゃんと引っ掛かる）。

Hiroshiba commented 7 months ago

たしかにです、closeします！

VOICEVOX / voicevox_engine