GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
195 stars 93 forks source link

OperatorConfig compression setting is now also applied for rule-eval. configMap from *Rules CRs [0.9] #763

Closed bwplotka closed 10 months ago

bwplotka commented 10 months ago

Fixes https://github.com/GoogleCloudPlatform/prometheus-engine/issues/549

bwplotka commented 10 months ago

Done, also tried to extend the e2e test to validate this. Added https://github.com/GoogleCloudPlatform/prometheus-engine/issues/765 idea too.

Testing using SKIP_GCM=false locally...

bwplotka commented 10 months ago

I didn't test it yet successfully against GCM, so technically this change is untested. Don't have time to manually test it.

Tagged v0.9.1-rc.3 without this change so far, up to us how we can continue here @pintohutch @TheSpiritXIII

pintohutch commented 10 months ago

I'm ok merging as well @bwplotka.

Note this should become much easier once #755 is merged and the e2e refactor is readily available. There you can just add a test function or step to the ruler_test.go file and run GOOGLE_APPLICATION_CREDENTIALS=<path-to-file> TEST_RUN=TestRuleEvaluator make e2e and it should run in a few minutes against GCM.

bwplotka commented 10 months ago

Ok, I am going to continue my e2e tests, but starting the tag + Louhi process to not waste time. Thanks for reviews!

bwplotka commented 10 months ago

... wait, I see a total disaster, blocking this and investigating.

image

bwplotka commented 10 months ago

Ok, we would just break & block another release of 0.9 accidently with this PR...

After hardcode debugging (I love kubectl debug now, but it's cryptic):

kubectl debug -it -n gmp-test-testruleevaluation-gzip-20240112-112556 rule-evaluator-6f9c9d85f7-qc9gt --image=busybox:1.28 --target evaluator:

image

Yamls are still compressed, because config-reloader does the magic for config files only. Other files are only "watched". We need to add a feature to config-reloader first and make sure we update images once this operator mode is enabled.

Closing for now, let's fix it in 0.10

pintohutch commented 10 months ago

Ah good catch - E2E tests would've caught that as well

shinebayar-g commented 8 months ago

It looks we're past 0.10 as https://github.com/GoogleCloudPlatform/prometheus-engine/releases/tag/v0.11.0-rc.2 is created? I don't know how you're versioning your repo. Is 0.10 skipped? I see bunch of 0.10-rc then jump to 0.11-rc.. Is there any chance this getting prioritized? Our GMP has been broken for months.

TheSpiritXIII commented 8 months ago

Hi @shinebayar-g we're still in the process of validating the 0.10 release. Our release process involves letting the release candidate soak in the GKE Rapid release channel first before stabilization. We're trying to release 0.10 soon and then put 0.11-rc into the GKE Rapid release channel next.

This project has been deprioritized now that Cloud Alerting supports PromQL alerts. We're recommending users move rules into Cloud Alerting where possible. You may still want some application rules and recording rules (which Cloud Alerting does not support) on the rule-evaluator.

I investigated the effort and it is sizable. Our biggest blocker right now is Thanos' reloader does not support decompressing directories. This would involve tracking files being added, renamed and removed. To get this merged upstream, this would also have to support symlinks. Since Thanos and we are open source, we're happy to accept contributions and review code in either repository.

Hope that helps!

shinebayar-g commented 8 months ago

Thanks for the reply @TheSpiritXIII

This project has been deprioritized now that Cloud Alerting supports PromQL alerts.

By this project you mean Google Managed Prometheus as a whole? Or specific component of it?

TheSpiritXIII commented 8 months ago

Only the rule-evaluator component has been deprioritized for the time being. Google Managed Service for Prometheus is very much alive and we're actively working on it!

shinebayar-g commented 8 months ago

We're recommending users move rules into Cloud Alerting where possible.

By this are you referring to this? https://cloud.google.com/monitoring/promql/promql-in-alerting & https://cloud.google.com/monitoring/promql/promql-migrate?

We already made a huge migration when we onboarded to GMP from Prometheus Operator. PrometheusRule -> Rules etc..

and it looks like this new one is configurable through Terraform which is completely different deployment model and api than what we currently have.. 😞

shinebayar-g commented 8 months ago

Okay, we'll follow https://cloud.google.com/stackdriver/docs/managed-prometheus/rules-unmanaged#eval-rules-unmanaged document to deploy self managed rule-evaluator.

Our biggest blocker right now is Thanos' reloader does not support decompressing directories.

This is assuming rules-generated config map is already compressed. But this isn't possible today. It looks like this PR already contains the necessary code to compress the generated configmap. If this PR is merged I can work on decompressing code on sidecar container side.

TheSpiritXIII commented 8 months ago

@shinebayar-g whipped up https://github.com/thanos-io/thanos/pull/7199 real quick. Haven't done much testing on it but hypothetically it should work with this change. If we get that change reviewed and merged upstream, we can try and revive this PR!

In the meanwhile you could try and build this PR yourself. Sadly, we can't currently merge this PR right now because it would be broken.

Thanks for your interest in GMP!

shinebayar-g commented 8 months ago

Sadly, we can't currently merge this PR right now because it would be broken.

Yeah I agree, thanks for the PR. Hope we'll get it soon! 🚀

TheSpiritXIII commented 7 months ago

@shinebayar-g fyi I've reopened this PR as https://github.com/GoogleCloudPlatform/prometheus-engine/pull/897 -- there's some dependencies we need released first so please track progress there! Thanks!

shinebayar-g commented 7 months ago

Thanks for the update @TheSpiritXIII