SeldonIO / MLServer

An inference server for your machine learning models, including support for multiple frameworks, multi-model serving and more
https://mlserver.readthedocs.io/en/latest/
Apache License 2.0
692 stars 177 forks source link

Performance difference between MLServer and the base library FastAPI #854

Open saeid93 opened 1 year ago

saeid93 commented 1 year ago

As per our Slack discussion with @adriangonz there is a performance overhead on MLServer in terms of received latency compared to pure FastAPI. As discussed in the same thread some performance overhead is negligible but there are cases in heavy loads that might need to be taken care of in future releases.

In the following benchmark, an MLServer with a dummy model which is an async waiting time of 1 second is used and it's compared to the equivalent FastAPI. Also, the effect of parallel_workers and metrics_endpoint have been evaluated. In total the following configuration has been evaluated:

  1. MLServer default settings
  2. MLServer disable both options
  3. MLServer disabling metrics endpoints
  4. MLServer disabling parallel workers
  5. plain FastAPI

The code of the FastAPI app:

import time
import random
from fastapi import FastAPI
import asyncio

app = FastAPI()
@app.post('/')
async def sumer():
    await asyncio.sleep(1)
    return {'1': "1"}

The code of the MLServer app:

import json
import asyncio
from mlserver import MLModel
from mlserver.logging import logger
from mlserver.types import (
    InferenceRequest,
    InferenceResponse)
from mlserver import MLModel
from mlserver.codecs import StringCodec
from typing import List, Dict

async def model(input, sleep=1):
  await asyncio.sleep(sleep)
  return 1

class MockOne(MLModel):
  async def load(self):
      self.loaded = True
      self.model = model
      return self.loaded

  async def predict(self, payload: InferenceRequest) -> InferenceResponse:
      if self.loaded == False:
          self.load()
      output: List[Dict] = await self.model(1)
      str_out = [json.dumps({'1':'1'})]
      prediction_encoded = StringCodec.encode_output(
      payload=str_out, name="output")
      logger.error(f"Output:\n{prediction_encoded}\nwas sent!")
      return InferenceResponse(
        id=payload.id,
        model_name=self.name,
        model_version=self.version,
        outputs = [prediction_encoded]
      )

The following are the result of sending a load of 100 RPS for 10 seconds. The result in the following total experiment time and also per request latencies. The x-axis is the chronological order of the sent requests and as you see there is an increasing trend in the sent requests' latencies.

Untitled Diagram drawio

According to the above benchmark, there is an increasing trend in the latency of requests and it seems that the parallel worker is the main bottleneck. But there is still some difference between all disabled options and the plain FastAPI which might show itself at high loads and the parallel worker overhead seems a bit high (almost doubles everything). Therefore as discussed in the same Slack thread that confirms that there are things we could improve on the parallel inference feature.

saeid93 commented 1 year ago

These are also the results for Seldon with svc and without svc. Might be of interest for further performance optimizations. seldon-with-svc seldon-without-svc