SeldonIO / MLServer

An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more
https://mlserver.readthedocs.io/en/latest/
Apache License 2.0
685 stars 178 forks source link

Asyncio Key Error Under Load #1312

Open edfincham opened 1 year ago

edfincham commented 1 year ago

When load testing an MLServer (deployed on AWS EKS with SC-V2) with this setup I get the following error whenever the size of the batches in my load tests exceeds ~2/3:

mlserver 2023-07-24 10:28:25,275 [mlserver.parallel] ERROR - Response processing loop crashed. Restarting the loop...
mlserver Traceback (most recent call last):
mlserver   File "/opt/conda/lib/python3.8/site-packages/mlserver/parallel/dispatcher.py", line 55, in _process_responses_cb
mlserver     process_responses.result()
mlserver   File "/opt/conda/lib/python3.8/site-packages/mlserver/parallel/dispatcher.py", line 76, in _process_responses
mlserver     await self._process_response(response)
mlserver   File "/opt/conda/lib/python3.8/site-packages/mlserver/parallel/dispatcher.py", line 81, in _process_response
mlserver     async_response = self._async_responses[internal_id]
mlserver KeyError: '93821e47-8589-48d2-a1c1-79a145b5ccf2'
mlserver 2023-07-24 10:28:25,276 [mlserver.parallel] DEBUG - Starting response processing loop...

For context, I'm using a server configured as follows:

apiVersion: mlops.seldon.io/v1alpha1
kind: Server
metadata:
  name: mlserver-test
  namespace: seldon-v2
spec:
  serverConfig: mlserver
  replicas: 5
  podSpec:
    containers:
    - name: mlserver
      env:
      - name: MLSERVER_PARALLEL_WORKERS
        value: "1"
      - name: SELDON_LOG_LEVEL
        value: DEBUG
      resources:
        requests:
          memory: "1000Mi"
          cpu: "500m"
        limits:
          memory: "4000Mi"
          cpu: "1000m"

And I do have adaptive batching enabled:

{
    "name": "test-model",
    "implementation": "wrapper.Model",
    "parameters": {
        "uri": "./model.onnx",
        "environment_tarball": "./environment.tar.gz"
    },
    "max_batch_size": 10,
    "max_batch_time": 0.5
}

The payloads are not particularly large (float32 - [1, 192, 256]) and I can see from monitoring the pods' memory consumption that they are well within the resource limits specified. I've also tried setting MLSERVER_PARALLEL_WORKERS to 0 which does solve this issue, but only by virtue of disabling parallel workers.

adriangonz commented 1 year ago

Hey @edfincham ,

It could be that the parallel workers are crashing for some unknown reason. Is there any other stacktrace that you can see from the logs?