claws / aioprometheus

A Prometheus Python client library for asyncio-based applications
174 stars 21 forks source link

Using timer decoration with fastapi causes high cpu usage and slows down metric endpoint #76

Closed MannerMan closed 1 year ago

MannerMan commented 2 years ago

Hi,

I'm using aioprometheus with a fastapi and uvicorn. I'm attempting to add request time metrics by using the @timer decorator, as provided in the examples.

This works, but if calling the endpoint which has the timer decoration frequently for a while, the metrics endpoint provided by aioprometheus gets really slow, and the service is consuming lots of CPU when calling the metrics-endpoint. I'm guessing as more and more requests gets counted, the metric calculations becomes more and more heavy. Should the metrics registry be reset at some interval to prevent this?

Minimal code to reproduce;

import uvicorn
from fastapi import FastAPI, Response, status
from aioprometheus import timer, Summary, MetricsMiddleware
from aioprometheus.asgi.starlette import metrics

app = FastAPI()

app.add_middleware(MetricsMiddleware)
app.add_route("/metrics", metrics)

# Create prometheus metrics to track time spent and requests made etc.
app.state.metric_request_time = Summary(
    "request_processing_seconds",
    "Time spent processing request")

# Primary routes
@app.get("/cluster/primary-v1")
@timer(app.state.metric_request_time, {"route": "/cluster/primary-v1"})
async def primary_v1(response: Response):
    return {'state': 'testing'}

# Grab event loop
loop = asyncio.new_event_loop()

server = uvicorn.Server(uvicorn.Config(app=app, loop=loop, log_config=None,
                                       access_log=None, host='0.0.0.0',
                                       port=7006))

future_uvicorn = loop.create_task(server.serve())

try:
    loop.run_until_complete(future_uvicorn)
except asyncio.exceptions.CancelledError:
    print("uvicorn was cancelled, exiting!")

Using python 3.10.4, with this requirements.txt

fastapi==0.75.1
uvicorn[standard]==0.17.6
aioprometheus[starlette]==22.3.0

I reproduce the issue by running ali for 1 hour: ali --rate=500 --duration=60m http://127.0.0.1:7006/cluster/primary-v1

When ali finished, a GET to /metrics endpoint take around 500ms, and uses 100% cpu. Maybe I'm doing something wrong, or could it be a bug?

binary0111 commented 2 years ago

The following are CPU hogs

https://github.com/claws/aioprometheus/blob/41f09424f8ddb0d35e4b0cf1fb7e3fb3f8a8e4ae/src/aioprometheus/collectors.py#L350

Quantile estimation is slow, could look into improving the estimation algorithm / using a different library. The official / sync python client does not support quantile estimation as yet.

Probably best to switch to some other metric for high use paths, perhaps a histogram, which will do less computation on the application side. Will need a custom decorator though or updating the current decorator code to support other metric types.

https://github.com/claws/aioprometheus/blob/41f09424f8ddb0d35e4b0cf1fb7e3fb3f8a8e4ae/src/aioprometheus/metricdict.py#L52

https://github.com/claws/aioprometheus/blob/41f09424f8ddb0d35e4b0cf1fb7e3fb3f8a8e4ae/src/aioprometheus/collectors.py#L178

Can be sped up by using orjson. The official / sync python client follows a somewhat different model for labels.

binary0111 commented 2 years ago

https://github.com/claws/aioprometheus/pull/77

MannerMan commented 2 years ago

Cool, this seems like a very good performance improvement! I suppose that even with orjson the metrics endpoint will end up slow over time though. I think I'll take your suggestion to switch metric, or calling REGISTRY.clear() at some interval, like 24h. Do you think that would be a valid approach? (resetting the registry at an interval)

binary0111 commented 2 years ago

No I dont think that clearing the registry is a good approach and I doubt it'll work anyway. At best it'll mask the problem. Might also be a good idea to profile your code with https://github.com/rkern/line_profiler or https://github.com/benfred/py-spy

binary0111 commented 2 years ago

https://github.com/claws/aioprometheus/pull/82

claws commented 1 year ago

The latest release, 23.3.0, includes the updates that allows Histogram to be used in the timer decorator (which seems to be the likely way forward for this issue). Can you test it this approach resolves the issue?

MannerMan commented 1 year ago

The latest release, 23.3.0, includes the updates that allows Histogram to be used in the timer decorator (which seems to be the likely way forward for this issue). Can you test it this approach resolves the issue?

Yup!

I changed the code in the example to use Histogram instead:

# Create prometheus metrics to track time spent and requests made etc.
app.state.metric_request_time = Histogram(
    "request_processing_seconds",
    "Time spent processing request")

Then I redid the same test on 23.3.0 that is defined in the bug-report, i.e. ran ali for 1 hour and then fetched the metrics endpoint.

image

Closing this issue now! :tada: :100: