SeldonIO / MLServer

An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more
https://mlserver.readthedocs.io/en/latest/
Apache License 2.0
695 stars 179 forks source link

gRPC async server looses track of `_futures` #1652

Open peter-resnick opened 6 months ago

peter-resnick commented 6 months ago

Hi MLServer -

To start off, this is an awesome tool and the team has impressive work to get to this point.

I'm currently using MLServer in a high-throughput, low-latency system where we use gRPC to perform inferences. We have added an asynchronous capability into our inference client which sends many requests to the gRPC server at once (typically about 25). We have a timeout set on our client and we first started seeing a number of DEADLINE_EXCEEDED responses and I started to look into the model servers themselves to figure out why the server had started to exceed deadlines (we hadn't experienced this very often in the past) and it looks like the process loop is actually being restarted due messages being lost.

We see the following traceback:

2024-03-28 19:56:42,015 [mlserver.parallel] ERROR - Response processing loop crashed. Restarting the loop...
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 186, in _process_responses_cb
    process_responses.result()
  File "/usr/local/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 207, in _process_responses
    self._async_responses.resolve(response)
  File "/usr/local/lib/python3.10/site-packages/mlserver/parallel/dispatcher.py", line 102, in resolve
    future = self._futures[message_id]
KeyError: 'cea95af0-859f-413a-a033-dfbe51e96c05'

where the dispatcher is trying to check on a given message, but it's lost. ^ once this error occurs once, all of the rest of our parallel inference requests fail with the same exception (different message_id obviously).

I took a look at the source code and it looks like when the process_response.result() is called, the logic has a blanket exception for anything that isnt an asyncio.CancelledError and assume that the process loop has crashed, so it restarts it by scheduling a new task, but it's not immediately clear (to me, at least) if this is really what should be happening. I don't see any signals from the server that the processing loop actually crashed - it just seems to be confused about which message its supposed to be getting.

As a note about our system set up, we have these deployed into Kubernetes (so is our client app) as a deployment with between 10-15 pods at any given time with environment variable MLSERVER_PARALLEL_WORKERS=16.

We are also using a grpc.aio.insecure_channel(server) pattern to manage the gRPC interactions on the client side.

ShengJinhao commented 3 months ago

I'd like to ask you if you've solved this problem yet? Thanks.

ramonpzg commented 2 months ago

Hi @ShengJinhao -- Thanks for bringing this up. I will assign this issue to myself and have a look at the underlying problem.

Do you have a reproducible example you can share with me regarding this behavior?

ShengJinhao commented 1 month ago

你好@ShengJinhao-- 感谢您提出这个问题。我会把这个问题分配给我自己,并看看根本问题是什么。

关于此行为,您有一个可以与我分享的可重现的例子吗?

Yes. But how should I share it with you? The problem I encountered is this: after I sent a large number of requests, I could receive them at first, but then I couldn't receive them at all, and I checked the logs and found the same problem as above.