CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.82k stars 580 forks source link

Option to expose metrics endpoint of Patroni on the container #3813

Open PaulVerhoeven1 opened 6 months ago

PaulVerhoeven1 commented 6 months ago

Overview

Patroni has a port 8008 on localhost in the container, on this port you can find metrics on the /metrics endpoint and status of the working of the cluster on /cluster (just tested this inside the pod with curl -k https://localhost:8008/metrics). I don't see a way to enable the exposure of this port on the container.

Use Case

The endpoint https://127.0.0.1:8008/cluster gives an overview of the current status of the cluster, including the replication lag. We can use this endpoint to monitor if the cluster is healthy, because if the lag is to high and the master node goes down. patroni can't switchover to the replica with the high lag. A possible scenario where the whole cluster goes down. Based on that endpoint we can also create alerting rules if the lag is to high.

Besides that we can use the metrics endpoint for metrics and we can use an Grafana dashboards to give a clear insight on the status of Patroni. With example this dashboard: https://grafana.com/grafana/dashboards/18870-postgresql-patroni/.

Desired Behavior

A possiblity to enable the exposure of the port on the postgrescluster

Think about (or someting else): postgrescluster.spec.patroni.enableport: true

or enable this port default in every Postgrescluster.

dsessler7 commented 6 months ago

Hello @PaulVerhoeven1!

I can create a feature request to have this port exposed in our backlog. Can you tell me a bit more about your use case?

We actually have alerting and trending of replication lag in our monitoring stack, so you are already covered there. Are there other metrics that these endpoints provide that you are particularly interested in?

PaulVerhoeven1 commented 6 months ago

I want to monitor the total current lag of patroni, As far as i can see in the replication lag in the monitoring stack doesn't show the total lag, i only saw spikes on times where there was a bit of lag in that dashboard.