Open mzwettler2 opened 1 year ago
I installed Prometheus using the Crunchy monitoring stack and do not see values for the up
metric for either the repo-host
or pgadmin
pods. In our Prometheus configuration, we drop pods that we are not interested in. If I remove our configuration and view all pods, I see the issue that you mention.
The up
metric is part of Prometheus' automatically generated labels: https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series
For each instance scrape, Prometheus stores a sample in the following time series:
- up{job="", instance=""}: 1 if the instance is healthy, i.e. reachable, or 0 if the scrape failed.
This means that an up
value is generated for everything that gets scraped, and we have limited control over the value that is returned.
I think you should be able to update your Prometheus configuration to ignore these pods. The relabel_config section can be used to drop sources based on labels. It might look something like this:
relabel_configs:
- source_labels: [__meta_kubernetes_pod_labelpresent_postgres_operator_crunchydata_com_pgbackrest_dedicated]
regex: "true"
action: drop
- source_labels: [__meta_kubernetes_pod_label_postgres_operator_crunchydata_com_role]
regex: "pgadmin"
action: drop
If you update your Prometheus configuration, do you still see the pods as down?
Ignoring this pods would also ignore any real problems on these pods.
Seems not to be a good idea at least for the repo host.
pgAdmin and the repo-host pods are out of the scope of the current monitoring solution. However, we understand why you would want some visibility into these components. We have submitted a feature enhancement in our backlog to consider these pods.
Questions
We have deployed PGO into our existing K8s environment with Prometheus/Grafana already pre-installed.
Everything is fine except that the Prometheus "up" metric shows all "repo-host" and "pgadmin" pods as down, even though they are running properly.
That means e.g. "up{pod="cmp-pgcluster-repo-host-0"}" gives 0
This happens for each "repo-host" and "pgadmin" pod.
Question: Any idea?
Environment
Platform: Anthos Platform Version: 1.10 (afaik) PGO Image Tag: postgres-operator:ubi8-5.3.0-0 Postgres Version: 14 Storage: vm ware csi Number of Postgres clusters: several