As per our Slack discussion with @adriangonz there is a performance overhead on MLServer in terms of received latency compared to pure FastAPI. As discussed in the same thread some performance overhead is negligible but there are cases in heavy loads that might need to be taken care of in future releases.
In the following benchmark, an MLServer with a dummy model which is an async waiting time of 1 second is used and it's compared to the equivalent FastAPI. Also, the effect of parallel_workers and metrics_endpoint have been evaluated.
In total the following configuration has been evaluated:
MLServer default settings
MLServer disable both options
MLServer disabling metrics endpoints
MLServer disabling parallel workers
plain FastAPI
The code of the FastAPI app:
import time
import random
from fastapi import FastAPI
import asyncio
app = FastAPI()
@app.post('/')
async def sumer():
await asyncio.sleep(1)
return {'1': "1"}
The code of the MLServer app:
import json
import asyncio
from mlserver import MLModel
from mlserver.logging import logger
from mlserver.types import (
InferenceRequest,
InferenceResponse)
from mlserver import MLModel
from mlserver.codecs import StringCodec
from typing import List, Dict
async def model(input, sleep=1):
await asyncio.sleep(sleep)
return 1
class MockOne(MLModel):
async def load(self):
self.loaded = True
self.model = model
return self.loaded
async def predict(self, payload: InferenceRequest) -> InferenceResponse:
if self.loaded == False:
self.load()
output: List[Dict] = await self.model(1)
str_out = [json.dumps({'1':'1'})]
prediction_encoded = StringCodec.encode_output(
payload=str_out, name="output")
logger.error(f"Output:\n{prediction_encoded}\nwas sent!")
return InferenceResponse(
id=payload.id,
model_name=self.name,
model_version=self.version,
outputs = [prediction_encoded]
)
The following are the result of sending a load of 100 RPS for 10 seconds. The result in the following total experiment time and also per request latencies. The x-axis is the chronological order of the sent requests and as you see there is an increasing trend in the sent requests' latencies.
According to the above benchmark, there is an increasing trend in the latency of requests and it seems that the parallel worker is the main bottleneck. But there is still some difference between all disabled options and the plain FastAPI which might show itself at high loads and the parallel worker overhead seems a bit high (almost doubles everything). Therefore as discussed in the same Slack thread that confirms that there are things we could improve on the parallel inference feature.
As per our Slack discussion with @adriangonz there is a performance overhead on MLServer in terms of received latency compared to pure FastAPI. As discussed in the same thread some performance overhead is negligible but there are cases in heavy loads that might need to be taken care of in future releases.
In the following benchmark, an MLServer with a dummy model which is an async waiting time of 1 second is used and it's compared to the equivalent FastAPI. Also, the effect of
parallel_workers
andmetrics_endpoint
have been evaluated. In total the following configuration has been evaluated:The code of the FastAPI app:
The code of the MLServer app:
The following are the result of sending a load of 100 RPS for 10 seconds. The result in the following total experiment time and also per request latencies. The x-axis is the chronological order of the sent requests and as you see there is an increasing trend in the sent requests' latencies.
According to the above benchmark, there is an increasing trend in the latency of requests and it seems that the parallel worker is the main bottleneck. But there is still some difference between all disabled options and the plain FastAPI which might show itself at high loads and the parallel worker overhead seems a bit high (almost doubles everything). Therefore as discussed in the same Slack thread that confirms that there are things we could improve on the parallel inference feature.