emmett-framework / granian

A Rust HTTP server for Python applications
BSD 3-Clause "New" or "Revised" License
2.67k stars 79 forks source link

Expose some generalized metrics (prometheus, etc?) #275

Closed nemith closed 5 months ago

nemith commented 5 months ago

Looking at an alternative to gunicorn and this project looks very promising. It would be nice if there was some general high level metrics exposed similar to what gunicorn has via statsd: https://docs.gunicorn.org/en/stable/instrumentation.html

Perhaps something based on https://metrics.rs/?

gi0baro commented 5 months ago

@nemith what would be the use-case for this? Or, in other words, what high level metrics would you expect to see exported by Granian?

Given the actual context, it is quite hard to see any advantages compared to an implementation from the inner framework/application, which would surely have better chances to report all the needed metrics and also for customisation. For instance, this is what the prometheus extension for Emmett does, and maybe is someway naive from me, but I think that's a better way to do something like this..

nemith commented 5 months ago

Nevermind.

Object905 commented 1 week ago

@gi0baro Hi. I'm using wsgi workers and since they're blocking it would be nice to see workers/backlog utilization. Also rust prometheus library comes with process feature that reports some extra statistics. process_open_fds and process_max_fds might be useful. Other metrics about cpu/memory usage I'm able to get from kubernetes, but might be useful for someone else.

Would you accept a PR for this?

gi0baro commented 1 week ago

@gi0baro Hi. I'm using wsgi workers and since they're blocking

@Object905 they're not. Workers are non-blocking for every interface WSGI included. What you'll have in WSGI are blocking threads, which just run the Python code.

it would be nice to see workers/backlog utilization.

When you say utilization what do you mean exactly? Also, backlog would be hard to monitor given it's handled directly by the kernel on the shared socket level, maybe you mean you'd like to see when backpressure kicks in?

Also rust prometheus library comes with process feature that reports some extra statistics. process_open_fds and process_max_fds might be useful. Other metrics about cpu/memory usage I'm able to get from kubernetes, but might be useful for someone else.

If I'm not mistaken, prometheus impl in Rust rely on tracing crate, which usually decrease performance quite a lot. I'm not sure I would happily exchange a lot of performance for metrics..

Would you accept a PR for this?

As soon as we have well-defined metrics and what's the use-case, sure thing.

Object905 commented 1 week ago

Yes, blocking threads mostly. backpressure - would be nice to see for completeness.

I had a problem that in one of my apps there are some really long requests and they're usually come in bursts, so there are times when k8s healthcheck (which is itself a request) start to fail due to timeout, because granian is busy handling other requests, so pod is terminated. While I know that it would be better to implement that with asyncio/shorter requests etc., but that's too much work with legacy code 🙂 So I was thinking to catch such situation and "scale early" based on roughly "% of time blocking threads are busy over last 30sec" metric. Requests are handled by least-connections load balancing, so by catching this early total blockage can be avoided. I could achieve that by other means, but granian seems to be "closest to the source".

For other unrelated apps: I've played with different configurations of workers/backpressure/backlog and found that default heuristic works nicely, but I feel that there is still room for improvement if I could see such metrics during spikes of traffic to do some fine tuning.

Regarding process feature - it doesn't do any tracing. It's just a wrapper to procfs.