Open mausch opened 8 months ago
Hey @mausch, I was able to replicate the issue you are seeing with a full disk. After the disk filled up, Postgres stopped working, but the pod was still ready.
We use the Patroni GET /readiness
endpoint to determine whether the pod is ready or not. If you take a look at the Patroni API docs for that endpoint, it will return 200
when "the Patroni node is running as the leader OR when PostgreSQL is up and running." In a single instance, when the database goes down, there isn't another instance to take leadership if the current leader becomes unhealthy. Like you have found, this gets into a case where the database is inaccessible, but the pods are still ready.
As it stands now, if Patroni thinks that the cluster is ready then the pods will be ready. There is likely some work we could do to augment the Patroni readiness endpoint and I will be happy to get a story in our backlog.
Hi, thanks for looking into this.
Shouldn't the liveness probe (rather than readiness) apply here?
The Patroni docs say:
GET /health: returns HTTP status code 200 only when PostgreSQL is up and running.
GET /liveness: returns HTTP status code 200 if Patroni heartbeat loop is properly running and 503 if the last run was more than ttl seconds ago on the primary or 2*ttl on the replica. Could be used for livenessProbe.
GET /readiness: returns HTTP status code 200 when the Patroni node is running as the leader or when PostgreSQL is up and running. The endpoint could be used for readinessProbe when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).
I don't have a pgo cluster at hand to check but presumably Patroni's /liveness is already mapped to the pod liveness?
Has this issue been resolved? My situation is that when restarting a physical machine, after restarting PostgreSQL, the database container fails to self-heal and continually shows a "no response" status. Additionally, it is not possible to restart the container using probes. Your situation seems very similar to mine. @mausch
Overview
I'm running a trivial CrunchyData instance with 1 primary. It ran out of disk space possibly due to https://github.com/CrunchyData/postgres-operator/issues/2531 but this is not relevant to this issue. Because of this the postgres pod is stuck in a loop displaying this:
i.e. Postgres isn't running at all, I can't connect to it. The problems are:
In short, Postgres is broken but the control plane or whatever you want to call it is not aware of it.
Environment
Please provide the following details:
ubi8-16.0-3.4-0