bentoml / BentoML

The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and much more!
https://bentoml.com
Apache License 2.0
7.13k stars 792 forks source link

bug: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Gather node #3201

Open Matthieu-Tinycoaching opened 1 year ago

Matthieu-Tinycoaching commented 1 year ago

Describe the bug

Hi,

While running locust tests (100 users with spawn=100) on the ONNX model of cross-encoder/ms-marco-minilm-l-2-v2, it failed near to the begining with the following message:

2022-11-08T12:29:36.521235725Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/server/runner_app.py", line 271, in _request_handler
2022-11-08T12:29:36.521240198Z     payload = await infer(params)
2022-11-08T12:29:36.521244217Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/marshal/dispatcher.py", line 166, in _func
2022-11-08T12:29:36.521248521Z     raise r
2022-11-08T12:29:36.521252449Z   File "/usr/local/lib/python3.8/dist-packages/uvicorn/protocols/http/h11_impl.py", line 407, in run_asgi
2022-11-08T12:29:36.521256722Z     result = await app(  # type: ignore[func-returns-value]
2022-11-08T12:29:36.521260777Z   File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/proxy_headers.py", line 78, in __call__
2022-11-08T12:29:36.521265065Z     return await self.app(scope, receive, send)
2022-11-08T12:29:36.521269072Z   File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/message_logger.py", line 86, in __call__
2022-11-08T12:29:36.521273401Z     raise exc from None
2022-11-08T12:29:36.521277350Z   File "/usr/local/lib/python3.8/dist-packages/uvicorn/middleware/message_logger.py", line 82, in __call__
2022-11-08T12:29:36.521281714Z     await self.app(scope, inner_receive, inner_send)
2022-11-08T12:29:36.521285768Z   File "/usr/local/lib/python3.8/dist-packages/starlette/applications.py", line 124, in __call__
2022-11-08T12:29:36.521290086Z     await self.middleware_stack(scope, receive, send)
2022-11-08T12:29:36.521294453Z   File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 184, in __call__
2022-11-08T12:29:36.521298872Z     raise exc
2022-11-08T12:29:36.521302747Z   File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/errors.py", line 162, in __call__
2022-11-08T12:29:36.521307103Z     await self.app(scope, receive, _send)
2022-11-08T12:29:36.521311166Z   File "/usr/local/lib/python3.8/dist-packages/opentelemetry/instrumentation/asgi/__init__.py", line 482, in __call__
2022-11-08T12:29:36.521315504Z     await self.app(scope, otel_receive, otel_send)
2022-11-08T12:29:36.521319514Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/server/http/instruments.py", line 293, in __call__
2022-11-08T12:29:36.521323867Z     await self.app(scope, receive, wrapped_send)
2022-11-08T12:29:36.521331276Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/server/http/access.py", line 126, in __call__
2022-11-08T12:29:36.521335629Z     await self.app(scope, receive, wrapped_send)
2022-11-08T12:29:36.521339689Z   File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 79, in __call__
2022-11-08T12:29:36.521344028Z     raise exc
2022-11-08T12:29:36.521347955Z   File "/usr/local/lib/python3.8/dist-packages/starlette/middleware/exceptions.py", line 68, in __call__
2022-11-08T12:29:36.521352304Z     await self.app(scope, receive, sender)
2022-11-08T12:29:36.521356401Z   File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 706, in __call__
2022-11-08T12:29:36.521360725Z     await route.handle(scope, receive, send)
2022-11-08T12:29:36.521365010Z   File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 276, in handle
2022-11-08T12:29:36.521369362Z     await self.app(scope, receive, send)
2022-11-08T12:29:36.521373425Z   File "/usr/local/lib/python3.8/dist-packages/starlette/routing.py", line 66, in app
2022-11-08T12:29:36.521377719Z     response = await func(request)
2022-11-08T12:29:36.521381621Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/server/runner_app.py", line 271, in _request_handler
2022-11-08T12:29:36.521385992Z     payload = await infer(params)
2022-11-08T12:29:36.521389971Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/marshal/dispatcher.py", line 166, in _func
2022-11-08T12:29:36.521394285Z     raise r
2022-11-08T12:29:36.521398173Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/marshal/dispatcher.py", line 232, in outbound_call
2022-11-08T12:29:36.521402705Z     outputs = await self.callback(tuple(d for _, d, _ in inputs_info))
2022-11-08T12:29:36.521406909Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/server/runner_app.py", line 239, in infer_batch
2022-11-08T12:29:36.521411213Z     batch_ret = await runner_method.async_run(
2022-11-08T12:29:36.521415893Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 51, in async_run
2022-11-08T12:29:36.521420169Z     return await self.runner._runner_handle.async_run_method(  # type: ignore
2022-11-08T12:29:36.521424325Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner_handle/local.py", line 57, in async_run_method
2022-11-08T12:29:36.521428742Z     return await anyio.to_thread.run_sync(
2022-11-08T12:29:36.521432889Z   File "/usr/local/lib/python3.8/dist-packages/anyio/to_thread.py", line 31, in run_sync
2022-11-08T12:29:36.521437169Z     return await get_asynclib().run_sync_in_worker_thread(
2022-11-08T12:29:36.521441226Z   File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
2022-11-08T12:29:36.521445552Z     return await future
2022-11-08T12:29:36.521452925Z   File "/usr/local/lib/python3.8/dist-packages/anyio/_backends/_asyncio.py", line 867, in run
2022-11-08T12:29:36.521457182Z     result = context.run(func, *args)
2022-11-08T12:29:36.521461192Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runnable.py", line 139, in method
2022-11-08T12:29:36.521466675Z     return self.func(obj, *args, **kwargs)
2022-11-08T12:29:36.521470680Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/frameworks/onnx.py", line 421, in _run
2022-11-08T12:29:36.521475041Z     return self.predict_fns[method_name](output_names, input_names)[0]
2022-11-08T12:29:36.521479112Z   File "/usr/local/lib/python3.8/dist-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 200, in run
2022-11-08T12:29:36.521483497Z     return self._sess.run(output_names, input_feed, run_options)
2022-11-08T12:29:36.521487662Z onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Gather node. Name:'/bert/embeddings/word_embeddings/Gather' Status Message: /onnxruntime_src/onnxruntime/core/framework/bfc_arena.cc:342 void* onnxruntime::BFCArena::AllocateRawInternal(size_t, bool) Failed to allocate memory for requested buffer of size 132422400
2022-11-08T12:29:36.521493032Z 
2022-11-08T12:29:36.524426491Z 2022-11-08T12:29:36+0000 [ERROR] [api_server:3] Exception on /minilm_l_2_v2_similarities_async [POST] (trace=1e309291e1fb7257b9664268bd47dcdd,span=22c844b1ba1a20b1,sampled=0)
2022-11-08T12:29:36.524439840Z Traceback (most recent call last):
2022-11-08T12:29:36.524443224Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/server/http_app.py", line 311, in api_func
2022-11-08T12:29:36.524446307Z     output = await api.func(input_data)
2022-11-08T12:29:36.524449023Z   File "/home/bentoml/bento/src/bentoml_gpu_onnx_ct2_service.py", line 230, in minilm_l_2_v2_similarities_async
2022-11-08T12:29:36.524452020Z     return await onnx_ms_marco_minilm_l_2_v2_runner.run.async_run(encoded_input_onnx.get('input_ids'), encoded_input_onnx.get('token_type_ids'), encoded_input_onnx.get('attention_mask'))
2022-11-08T12:29:36.524454946Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner.py", line 51, in async_run
2022-11-08T12:29:36.524457768Z     return await self.runner._runner_handle.async_run_method(  # type: ignore
2022-11-08T12:29:36.524460452Z   File "/usr/local/lib/python3.8/dist-packages/bentoml/_internal/runner/runner_handle/remote.py", line 163, in async_run_method
2022-11-08T12:29:36.524463339Z     raise RemoteException(
2022-11-08T12:29:36.524465911Z bentoml.exceptions.RemoteException: An exception occurred in remote runner ms-marco-minilm-l-2-v2: [500] Internal Server Error
2022-11-08T12:29:36.525132624Z 2022-11-08T12:29:36+0000 [INFO] [api_server:3] 109.205.64.66:9066 (scheme=http,method=POST,path=/minilm_l_2_v2_similarities_async,type=application/json,length=9321) (status=500,type=application/json,length=2) 6564.692ms (trace=1e309291e1fb7257b9664268bd47dcdd,span=22c844b1ba1a20b1,sampled=0)

To reproduce

No response

Expected behavior

No response

Environment

bentoml: 1.0.7 python: 3.8.13 platform: Linux-5.4.0-65-generic-x86_64-with-glibc2.17 uid:gid: 1000:1000 conda: 22.9.0 in_conda_env: True

aarnphm commented 1 year ago

cc @larme

larme commented 1 year ago

@Matthieu-Tinycoaching This seems like a memory allocation error. Do you serve the model on CPU or GPU?

Matthieu-Tinycoaching commented 1 year ago

Hi @larme I serve it on GPU. This seems weird since this model is lighter and faster than other models that run well in same conditions