No disk space crashloop but pod healthy

mausch commented 8 months ago

Overview

I'm running a trivial CrunchyData instance with 1 primary. It ran out of disk space possibly due to https://github.com/CrunchyData/postgres-operator/issues/2531 but this is not relevant to this issue. Because of this the postgres pod is stuck in a loop displaying this:

2023-11-27 16:43:12,175 INFO: Lock owner: ; I am postgres-instance1-mqhs-0
2023-11-27 16:43:12,175 INFO: not healthy enough for leader race
2023-11-27 16:43:12,176 INFO: doing crash recovery in a single user mode in progress

i.e. Postgres isn't running at all, I can't connect to it. The problems are:

The pod still shows up as healthy. Being unhealthy and restarting wouldn't fix anything in this case but this could be used to trigger some monitors/alerts to highlight that things aren't right.
The operator logs show no issues at all.

In short, Postgres is broken but the control plane or whatever you want to call it is not aware of it.

Environment

Please provide the following details:

Platform: k3s
Platform Version: 1.28
PGO Image Tag: ubi8-16.0-3.4-0
Postgres Version: 16
Storage: EBS

jmckulk commented 7 months ago

Hey @mausch, I was able to replicate the issue you are seeing with a full disk. After the disk filled up, Postgres stopped working, but the pod was still ready.

We use the Patroni GET /readiness endpoint to determine whether the pod is ready or not. If you take a look at the Patroni API docs for that endpoint, it will return 200 when "the Patroni node is running as the leader OR when PostgreSQL is up and running." In a single instance, when the database goes down, there isn't another instance to take leadership if the current leader becomes unhealthy. Like you have found, this gets into a case where the database is inaccessible, but the pods are still ready.

As it stands now, if Patroni thinks that the cluster is ready then the pods will be ready. There is likely some work we could do to augment the Patroni readiness endpoint and I will be happy to get a story in our backlog.

mausch commented 7 months ago

Hi, thanks for looking into this.

Shouldn't the liveness probe (rather than readiness) apply here?

The Patroni docs say:

GET /health: returns HTTP status code 200 only when PostgreSQL is up and running.

GET /liveness: returns HTTP status code 200 if Patroni heartbeat loop is properly running and 503 if the last run was more than ttl seconds ago on the primary or 2*ttl on the replica. Could be used for livenessProbe.

GET /readiness: returns HTTP status code 200 when the Patroni node is running as the leader or when PostgreSQL is up and running. The endpoint could be used for readinessProbe when it is not possible to use Kubernetes endpoints for leader elections (OpenShift).

I don't have a pgo cluster at hand to check but presumably Patroni's /liveness is already mapped to the pod liveness?

HuangQAQ commented 1 month ago

Has this issue been resolved? My situation is that when restarting a physical machine, after restarting PostgreSQL, the database container fails to self-heal and continually shows a "no response" status. Additionally, it is not possible to restart the container using probes. Your situation seems very similar to mine. @mausch

CrunchyData / postgres-operator

No disk space crashloop but pod healthy #3788

Overview

Environment