GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
195 stars 93 forks source link

How to resource scale the default GMP rules-evaluator? #971

Closed clearclaw closed 5 months ago

clearclaw commented 5 months ago

Since I added a few dozen rewrite rules, the GMP-default rules-evaluator is frequently OOMKilling. Editing the deployment to up the resources gets reset to defaults by the nanny and I'm not seeing a ConfigMap/way to override the defaults. What to do?

I'm currently running a few dozen rewrite rules (some admittedly busy) and expect to be in the middle to upper hundreds of rewrite rules (eg for SLOs) as we go to production. Trying to figure how to get this to scale?

image

pintohutch commented 5 months ago

Hi @clearclaw,

Thanks for reaching out, and apologies you are hitting this.

I imagine you are using the built-in rule-evaluator as part of the managed collection stack.

This component is hardcoded to have a 1G memory limit, which is what you are likely hitting.

Out of curiousity, how many Rules resources are you using? And are your rule queries typically consolidated into a few groups? Or spread among many groups?

clearclaw commented 5 months ago

Hi @pintohutch !

Yep, GMP and built-in evaluator. Current breaking load is a ~dozen groups each of 10-25 rules.

I'm guessing by your question that rules groups are executed in a single transactional context and thus are an indivisible resource unit? I can look at breaking up some of the larger groups, but as this rolls out we'll "naturally" have 10^2 groups with (zipf curve) from mostly ~5 rules to a ~handful of 20-30 rules groups at the upper end. Am I looking at unhappiness?

I'm also guessing that the current GMP-default replicaset of 2 evaluators doesn't scale...?

clearclaw commented 5 months ago

@pintohutch What are the primary factors in memory consumption in the default rule-evaluator?

pintohutch commented 5 months ago

Hey @clearclaw,

I'm guessing by your question that rules groups are executed in a single transactional context and thus are an indivisible resource unit?

Sort of, per the docs:

Rules within a group are run sequentially at a regular interval, with the same evaluation time.

Essentially, the more distinct rule_groups you have, the more parallel executions there are. If you have a high evaluation interval (e.g. 5m+) you may be able to group more of your rules together so you have less concurrent evaluations, and presumably a smaller resource hit.

I can look at breaking up some of the larger groups, but as this rolls out we'll "naturally" have 10^2 groups with (zipf curve) from mostly ~5 rules to a ~handful of 20-30 rules groups at the upper end. Am I looking at unhappiness?

Manually breaking up rule groups does not sound fun :), alternatively...

The rule-evaluator could be a good candidate for a VPA. If you're using a GKE standard cluster, you need to ensure it is enabled first. You can adjust the example to suit your needs. Note: we have not extensively tested this ourselves, and would love any feedback you have in using it.

What are the primary factors in memory consumption in the default rule-evaluator?

Empirically, we've seen memory usage increase with the number of rule groups, as well as the complexity of the query (i.e. how long the rule-evaluator gRPC client has to hold the connection).

Hope that helps.

lyanco commented 5 months ago

Could you possibly paste a smattering of your rules? Long-horizon queries, especially those that go back further than 25 hours in time, can be much slower than those that are within the 25 hour horizon. This slowness could be causing your rule evaluator to wait longer, consuming more resources and causing this issue.

clearclaw commented 5 months ago

(progress report)

Thanks Daniel. The VPA is great. Memory is consistently riding between 1.4GB and 1.8GB, but it is solid as a rock (and I'm amused that the nanny is not resetting it). Which is cool, as frantic prep for company demo now.

Lee, I'll see about getting you a set of sample rules but am distracted by impending demo. Meanwhile, the primary offender appears to have been a ruleset attempting to recast Istio's istio_request_duration_milliseconds into seconds (for use with more tools like Pyrra that insist on a timebase of seconds). Much has changed today, because 24 hours and demo and stuff, but our primary loading is going to be the GMP Rules versions of Pyrra's generated PrometheusRules for SLOs for a hundredish microservices, gateways and related bits. Apologies for the delay.

pintohutch commented 5 months ago

Thanks Daniel. The VPA is great. Memory is consistently riding between 1.4GB and 1.8GB, but it is solid as a rock (and I'm amused that the nanny is not resetting it). Which is cool, as frantic prep for company demo now.

Fantastic! Maybe we should have built-in autoscaling supported then - created https://github.com/GoogleCloudPlatform/prometheus-engine/issues/975.

clearclaw commented 5 months ago

Coolness. In the next ~week I'll be dropping in middle hundreds of Pyrra SLOs and their supporting recording rules. I'll report back how/if that flies.

Dunno this is your bag, so just mentioning in case it is as the issue I raised with GCP submarined: https://github.com/pyrra-dev/pyrra/discussions/1062 Short version is that upstream Prometheus is generous in what it accepts and GMP/Monarch isn't (I assume from a protobuf reduction), and that breaks stuff. Current result is that I'm running an NGINX/Lua proxy to rewrite queries to Prometheus, and that's, umm, LessGood.

-- JCL then mutters something about please-pretty-please patch CRDs for OperatorConfig for GitOps deploys.

On Thu, May 16, 2024 at 7:20 AM Daniel Clark @.***> wrote:

Thanks Daniel. The VPA is great. Memory is consistently riding between 1.4GB and 1.8GB, but it is solid as a rock (and I'm amused that the nanny is not resetting it). Which is cool, as frantic prep for company demo now.

Fantastic! Maybe we should have built-in autoscaling supported then - created #975 https://github.com/GoogleCloudPlatform/prometheus-engine/issues/975.

— Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/prometheus-engine/issues/971#issuecomment-2115381397, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJYHIPA6V7RB4QFPM52OGLZCS6CXAVCNFSM6AAAAABHVJWHBOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJVGM4DCMZZG4 . You are receiving this because you were mentioned.Message ID: @.***>

bwplotka commented 5 months ago

Thanks! We can also check what we can do about extra request/query parameter. But validating all parameters might be a better UX on our side. Let us know (ideally in separate issue) if this will become a blocker. Thanks!