cloudnative-pg / cloudnative-pg

CloudNativePG is a comprehensive platform designed to seamlessly manage PostgreSQL databases within Kubernetes environments, covering the entire operational lifecycle from initial deployment to ongoing maintenance
https://cloudnative-pg.io
Apache License 2.0
4.21k stars 284 forks source link

[Bug]: "Cluster is in healthy state" despite 0 running pods #5150

Open usernamedt opened 1 month ago

usernamedt commented 1 month ago

Is there an existing issue already for this bug?

I have read the troubleshooting guide

I am running a supported version of CloudNativePG

Contact Details

No response

Version

1.23.2

What version of Kubernetes are you using?

1.27

What is your Kubernetes environment?

Cloud: Other

How did you install the operator?

YAML manifest

What happened?

Hi! I am trying to check how well operator prepared for detecting different Postgres malfunctions. Recently I've tried to simulate scenario with Postgres failing to start:

➜  ~ kubectl exec --stdin --tty postgresql-cluster-1 -- /bin/bash

Defaulted container "postgres" out of: postgres, bootstrap-controller (init)
postgres@postgresql-cluster-1:/$ mv /var/lib/postgresql/data/pgdata/global/pg_control /var/lib/postgresql/data/pgdata/global/pg_control_bck
postgres@postgresql-cluster-1:/$ ps aux | grep postgres
postgres       1  0.1  0.3 1280364 59440 ?       Ssl  12:38   0:04 /controller/manager instance run --log-level=info
postgres      26  0.0  0.2 235368 41892 ?        S    12:38   0:01 postgres -D /var/lib/postgresql/data/pgdata
postgres      27  0.0  0.0  73816  6348 ?        Ss   12:38   0:00 postgres: postgresql-cluster: logger
postgres      28  0.0  0.1 235624 17588 ?        Ss   12:38   0:00 postgres: postgresql-cluster: checkpointer
postgres      29  0.0  0.0 235592  6732 ?        Ss   12:38   0:00 postgres: postgresql-cluster: background writer
postgres      32  0.0  0.0 235368 11480 ?        Ss   12:38   0:00 postgres: postgresql-cluster: walwriter
postgres      33  0.0  0.0 237244  9392 ?        Ss   12:38   0:00 postgres: postgresql-cluster: autovacuum launcher
postgres      34  0.0  0.0 235472  8188 ?        Ss   12:38   0:00 postgres: postgresql-cluster: archiver last was 000000010000000000000005
postgres      35  0.0  0.0 237176  9472 ?        Ss   12:38   0:00 postgres: postgresql-cluster: logical replication launcher
postgres    2942  0.0  0.0   7636  4324 pts/0    Ss   13:27   0:00 /bin/bash
postgres    2980  0.0  0.0  10072  1592 pts/0    R+   13:28   0:00 ps aux
postgres    2981  0.0  0.0   6480  2340 pts/0    S+   13:28   0:00 grep postgres
postgres@postgresql-cluster-1:/$ kill -9 26
postgres@postgresql-cluster-1:/$ command terminated with exit code 137

Despite the fact that now I have a fully broken cluster, cluster phase does not reflect that:

➜  ~ kubectl cnpg  status postgresql-cluster
Cluster Summary
Name:                postgresql-cluster
Namespace:           ***
PostgreSQL Image:    ***
Primary instance:    postgresql-cluster-1
Primary start time:  2024-07-23 15:32:04 +0000 UTC (uptime 21h58m59s)
Status:              Cluster in healthy state
Instances:           1
Ready instances:     0

Certificates Status
Certificate Name                Expiration Date                Days Left Until Expiration
----------------                ---------------                --------------------------
cafile                          2030-12-31 09:37:37 +0000 UTC  2350.84
cert-postgres-secret            2025-07-23 10:00:00 +0000 UTC  363.85
postgresql-cluster-ca           2024-10-21 15:26:18 +0000 UTC  89.08
postgresql-cluster-replication  2024-10-21 15:26:18 +0000 UTC  89.08

Continuous Backup status
First Point of Recoverability:  2024-07-23T15:32:24Z
No Primary instance found
Physical backups
Primary instance not found

Streaming Replication status
Not configured

Unmanaged Replication Slot Status
No unmanaged replication slots found

Managed roles status
No roles managed

Tablespaces status
No managed tablespaces

Pod Disruption Budgets status
Name                        Role     Expected Pods  Current Healthy  Minimum Desired Healthy  Disruptions Allowed
----                        ----     -------------  ---------------  -----------------------  -------------------
postgresql-cluster-primary  primary  1              0                1                        0

Instances status
Name                  Database Size  Current LSN  Replication role  Status             QoS         Manager Version  Node
----                  -------------  -----------  ----------------  ------             ---         ---------------  ----
postgresql-cluster-1  -              -            -                 pod not available  Guaranteed  -                xxx

In operator logs there are endless messages regarding pod connection errors:

{"level":"info","ts":"2024-07-24T13:32:51Z","msg":"Cannot extract Pod status","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"postgresql-cluster","namespace":"***"},"namespace":"***","name":"postgresql-cluster","reconcileID":"87e29fe9-4ffe-412d-a96b-b40108c4f1c3","name":"postgresql-cluster-1","error":"Get \"http://10.113.49.217:8000/pg/status\": dial tcp 10.113.49.217:8000: connect: connection refused"}

From my point of view, the expected behavior in such a case is to reflect the actual degraded state of the cluster in the cluster phase/conditions.

Cluster resource

No response

Relevant log output

No response

Code of Conduct

gbartolini commented 1 month ago

Can you please try with the latest snapshot? See https://cloudnative-pg.io/documentation/current/installation_upgrade/#testing-the-latest-development-snapshot