GoogleCloudPlatform / prometheus-engine

Google Cloud Managed Service for Prometheus libraries and manifests.
https://g.co/cloud/managedprometheus
Apache License 2.0
191 stars 89 forks source link

secrets: Improve debuggability & reliability of misconfigured *Monitoring CRs with secrets. #917

Open bwplotka opened 5 months ago

bwplotka commented 5 months ago

(This relates to unreleased feature from https://github.com/GoogleCloudPlatform/prometheus-engine/pull/776 PR)

When the secret is configured in e.g. PodMonitoring but not found by the Prometheus we get nice Target Page error:

image

Hopefully this works with Target Status feature too. I think it does not fail the Prometheus config apply, but didn't check.

However, when user forgets to add permissions for the existing, well-referenced secret, the Prometheus scrape config parsing (and reloading) fails, we get cryptic unknown error and status page shows 401 unauthorized.

Full log:

{"caller":"main.go:1326","err":"unable to watch secret default/go-synthetic-basic-auth: unknown (get secrets)","level":"error","msg":"Failed to apply configuration","ts":"2024-03-26T21:24:20.265Z"}
{"caller":"main.go:1043","err":"one or more errors occurred while applying the new configuration (--config.file=\"/prometheus/config_out/config.yaml\")","level":"error","msg":"Error reloading config","ts":"2024-03-26T21:24:20.266Z

Consequences for failing config reloading are not as bad as I initially thought, it's only per reloader per job functionality got stopped in some state, but perhaps there is a way to have consistent status page error instead of failing applying.

I have rdy GKE cluster with your changes applied (will have it running for some time) if you want to check e.g. @TheSpiritXIII

AC

Nice to have: