GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
196 stars 93 forks source link

feat: add /api/v1/alerts endpoint #1161

Closed yama6a closed 1 month ago

yama6a commented 1 month ago

Adding a prometheus-compatible API endpoint to rule-evaluator: /api/v1/alerts

penekk commented 1 month ago

Excellent @yama6a

Would this finally be enough to pull the rules into Grafana (from the evaluator), like it is possible with vanilla prometheus?

yama6a commented 1 month ago

Excellent @yama6a

Would this finally be enough to pull the rules into Grafana (from the evaluator), like it is possible with vanilla prometheus?

@penekk Probably yes-ish.

For now, the endpoints will only be available on the rules-evaluator pod, not the prometheus-frontend pod, the latter of which you will probably use as a Prometheus-Datasource at the moment (running promQL queries, and fetching metrics/labels, etc).

A temporary hack could be to add the rule-evaluator pod as a separate Prometheus-Datasource in Grafana (once this is released), but I hope to make a PR for a proper solution soon.

Currently, the frontend will blindly proxy all calls to /api/* to Google-Monitoring's prometheus endpoint (https://monitoring.googleapis.com/v1/projects/PROJECT_ID/location/global/prometheus), which doesn't work for these two new endpoints. I'm working on fixing that, and hope to have a solution soon. Hopefully before the next release.

penekk commented 1 month ago

Excellent @yama6a Would this finally be enough to pull the rules into Grafana (from the evaluator), like it is possible with vanilla prometheus?

@penekk Probably yes-ish.

For now, the endpoints will only be available on the rules-evaluator pod, not the prometheus-frontend pod, the latter of which you will probably use as a Prometheus-Datasource at the moment (running promQL queries, and fetching metrics/labels, etc).

@yama6a Actually the frontend is sort of 'out of the picture' because of datasyncer setup but I get your point.

A temporary hack could be to add the rule-evaluator pod as a separate Prometheus-Datasource in Grafana (once this is released), but I hope to make a PR for a proper solution soon.

That's what I was planning to do anyway :) (after seeing your previous contribution from a while back)

Currently, the frontend will blindly proxy all calls to /api/* to Google-Monitoring's prometheus endpoint (https://monitoring.googleapis.com/v1/projects/PROJECT_ID/location/global/prometheus), which doesn't work for these two new endpoints. I'm working on fixing that, and hope to have a solution soon. Hopefully before the next release.

Good to know. If there is any way I can help with the whole thing (testing and what not, not that versed with go yet although might be a good opportunity to change that :grin:) just let me know.

And thank you a lot for working on this.

bwplotka commented 1 month ago

Currently, the frontend will blindly proxy all calls to /api/* to Google-Monitoring's prometheus endpoint (https://monitoring.googleapis.com/v1/projects/PROJECT_ID/location/global/prometheus), which doesn't work for these two new endpoints. I'm working on fixing that, and hope to have a solution soon. Hopefully before the next release.

Nice! Yea, perhaps it's a good feature for frontend proxy or some separate proxy like thanos-querier or promxy. One day in Cloud Alerting too? 🤞🏽

bwplotka commented 1 month ago

Presubmit error is cryptic but it's detecting files without license:

#30 DONE 21.4s
/home/runner/go/bin/addlicense-v1.1.1 -check -ignore 'third_party/**' -ignore 'vendor/**' .
cmd/rule-evaluator/internal/alerts.go
cmd/rule-evaluator/internal/alerts_test.go
cmd/rule-evaluator/internal/api_test.go
make: *** [Makefile:144: regen] Error 1
yama6a commented 1 month ago

Presubmit error is cryptic but it's detecting files without license:

@bwplotka Done.

I also opened a PR (https://github.com/GoogleCloudPlatform/prometheus-engine/pull/1162) for adding a potential way for proxying these calls. Not tested in an actual cluster though, so would be good to be extra-cautious with these added k8s manifests.

yama6a commented 1 month ago

@penekk

@yama6a Actually the frontend is sort of 'out of the picture' because of datasyncer setup but I get your point.

A temporary hack could be to add the rule-evaluator pod as a separate Prometheus-Datasource in Grafana (once this is released), but I hope to make a PR for a proper solution soon.

That's what I was planning to do anyway :) (after seeing your previous contribution from a while back)

If you add the rule-evaluator as a data-source (while it will probably work) I would expect a bunch of UI and log errors, due to the fact that Grafana is going to try to use this as a query-datasource (which of course won't work). and Grafana-users can pick it as a datasource for their Dashboards and in the Metrics-Explorer and such. Not the end of the world of course, and perhaps the lesser evil in your case, just a fair warning.

lyanco commented 1 month ago

Thank you for the contribution!! Should be a nice addition for debugging and general visibility.

Question for you - given that managed collection deploys one rule evaluator per cluster, there's no way to get a global view of your rules with this endpoint. You could only get a global view if there was only one rule evaluator, such as if you are using self-deployed evaluation or if you only use GlobalRules.

Cloud Alerting's promql alerts work the same way as global rules but they can give you a better view in a centralized place. Have you looked into that? Is there some sort of feature deficit there that is stopping you from using it?

yama6a commented 1 month ago

@lyanco

Question for you - given that managed collection deploys one rule evaluator per cluster, there's no way to get a global view of your rules with this endpoint. You could only get a global view if there was only one rule evaluator, such as if you are using self-deployed evaluation or if you only use GlobalRules.

I see what you mean; I suppose it depends on the individual setup and use-case.

In our case, we have one cluster (and hence one rule-evaluator) with one Grafana, per project. So, the global view is no problem if we pipe the rules via rule-evaluator into Grafana as a datasource. If there would be multiple clusters with one rule-evaluator respectively, we could still opt for adding all of those as datasources to a central Grafana, and use that as a place of consolidation to achieve a global view. (Or something else instead of Grafana that ingests multiple rule-evaluators' endpoints).

I'm not sure how using only GlobalRules would help, as you mentioned. Could you elaborate how this would affect a setup with multiple clusters? Wouldn't we still have multiple rule-evaluators which ingest the GlobalRules from the different clusters?

Cloud Alerting's promql alerts work the same way as global rules but they can give you a better view in a centralized place. Have you looked into that? Is there some sort of feature deficit there that is stopping you from using it?

We did look into it, and it essentially boils down to developer experience, with GlobalRules adhering to the per-se industry standard of PrometheusRule resources (=> familiarity + ease of migration), and an IaC-decision of Terraform vs. Kubernetes or Atlantis vs ArgoCD. The full answer here is a bit more involved.

Do you guys have a Slack or a similar communications channel to run it by you if you want? As well as see what architectural solution could be beneficial to everyone using GMP (With reference to https://github.com/GoogleCloudPlatform/prometheus-engine/pull/1162). Would be happy to contribute more to this project.

lyanco commented 1 month ago

@yama6a

Got it, thanks for the explanation. If you have 1:1:1 cluster:project:grafana then yes this should work for you. If you have more than one cluster, you'd have to have multiple datasources in Grafana connecting to multiple rule-evaluators to show you all your rules, and it wouldn't be a single pane of glass (you'd have to tab between datasources).

In this scenario, designating one cluster as your Rules cluster and only using GlobalRules would allow you to connect Grafana to that rule-evaluator only and get a single pane of glass that way. But then you lose the tenancy benefits of Rules/ClusterRules resources. It wouldn't be recommended. FWIW if you use GlobalRules we recommend only putting them in one rule-evaluator, otherwise you end up getting collisions.

Noted on the cloud alerting experience, that makes sense. We're not getting rid of the cluster-local rule path, although we are investing in it less.

We made a chat room in Google Chat for you... I think bartek might have invited you?