CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.93k stars 591 forks source link

prometheus "up" metrics shows repo-host and pgadmin pods as down #3650

Open mzwettler2 opened 1 year ago

mzwettler2 commented 1 year ago

Questions

We have deployed PGO into our existing K8s environment with Prometheus/Grafana already pre-installed.

Everything is fine except that the Prometheus "up" metric shows all "repo-host" and "pgadmin" pods as down, even though they are running properly.

That means e.g. "up{pod="cmp-pgcluster-repo-host-0"}" gives 0

This happens for each "repo-host" and "pgadmin" pod.

Question: Any idea?

Environment

Platform: Anthos Platform Version: 1.10 (afaik) PGO Image Tag: postgres-operator:ubi8-5.3.0-0 Postgres Version: 14 Storage: vm ware csi Number of Postgres clusters: several

jmckulk commented 1 year ago

I installed Prometheus using the Crunchy monitoring stack and do not see values for the up metric for either the repo-host or pgadmin pods. In our Prometheus configuration, we drop pods that we are not interested in. If I remove our configuration and view all pods, I see the issue that you mention.

The up metric is part of Prometheus' automatically generated labels: https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series

For each instance scrape, Prometheus stores a sample in the following time series:

  • up{job="", instance=""}: 1 if the instance is healthy, i.e. reachable, or 0 if the scrape failed.

This means that an up value is generated for everything that gets scraped, and we have limited control over the value that is returned.

I think you should be able to update your Prometheus configuration to ignore these pods. The relabel_config section can be used to drop sources based on labels. It might look something like this:

      relabel_configs:
      - source_labels: [__meta_kubernetes_pod_labelpresent_postgres_operator_crunchydata_com_pgbackrest_dedicated]
        regex: "true"
        action: drop
      - source_labels: [__meta_kubernetes_pod_label_postgres_operator_crunchydata_com_role]
        regex: "pgadmin"
        action: drop

If you update your Prometheus configuration, do you still see the pods as down?

mzwettler2 commented 1 year ago

Ignoring this pods would also ignore any real problems on these pods.

Seems not to be a good idea at least for the repo host.

jmckulk commented 1 year ago

pgAdmin and the repo-host pods are out of the scope of the current monitoring solution. However, we understand why you would want some visibility into these components. We have submitted a feature enhancement in our backlog to consider these pods.