Summaries are getting slower and using more and more memory in long term

pelt commented 1 year ago

The underlying quantile package seems to getting slower and using more memory if configured with more than one invariants which is the default in both, aioprometheus and quantile. This is getting an issue for long-running services which gather millions of measurements for one summary metric. The response time for premetheus can increase to over one second and more.

A current workaround is to use exactly one invariant/quantile (if it's feasible for your use cases) so that this issue is not triggered within the quantile package.

JacobHenner commented 1 year ago

The underlying quantile package seems to getting slower and using more memory if configured with more than one invariants which is the default in both, aioprometheus and quantile.

How was this assessed? Do you have a reproducer?

Is this an issue with how aioprometheus uses the quantile library? Or is there a bug in the quantile library itself?

Why is the use of one invariant vs > 1 invariant relevant?

JacobHenner commented 1 year ago

I've been experimenting with this - I think there's either a bug (or deliberate difference) in the quantile library compared to the quantile libraries used in other language prometheus client libraries.

To test this theory, I calculated the (0.5, .005),(0.90,0.001),(0.99, 0.0001) quantiles for one million random integers in range [0,10) using both https://github.com/matttproud/python_quantile_estimation (Python, used here) and github.com/beorn7/perks/ (Go, used in the official Go prometheus client library).

The Go implementation took ~1.5 seconds to run and maintained ~1250 samples. Using the same class of input, the Python implementation maintained 104659 samples and took ~192 seconds. Both libraries claim to use the same algorithm from paper Effective Computation of Biased Quantiles over Data Streams.

I don't believe this is an issue specific to aioprometheus. However, one thing to note is that other prometheus client libraries (including Java, Go) implement sliding windows for Summaries. If I understand correctly, having sliding windows in aioprometheus's Summary implementation would provide an upper limit on how many samples would be retained (the maximum number of observations logged within the window). Perhaps supporting sliding windows should be considered. It looks like someone has tried: https://github.com/RefaceAI/aioprometheus-summary/blob/main/aioprometheus_summary/__init__.py

JacobHenner commented 1 year ago

The Go implementation took ~1.5 seconds to run and maintained ~1250 samples. Using the same class of input, the Python implementation maintained 104659 samples and took ~192 seconds. Both libraries claim to use the same algorithm from paper Effective Computation of Biased Quantiles over Data Streams.

I've written my own implementation, inspired by the Go implementation. It's performance and memory utilization is much better, and it passes the Go implementation's tests. I hope to release it publicly sometime soon.

alfiedotwtf commented 4 months ago

Hey Jacob. Did you manage to release the faster implementation? Having a look at the source, it seems that it's still using quantile-python which looks like it's from 2015 on pypy

JacobHenner commented 4 months ago

Hey Jacob. Did you manage to release the faster implementation? Having a look at the source, it seems that it's still using quantile-python which looks like it's from 2015 on pypy

Not yet, but I did get approval to do so - I'll try to share as soon as I have a chance.

alfiedotwtf commented 4 months ago

No problem, thanks for the update.

claws / aioprometheus

Summaries are getting slower and using more and more memory in long term #88