PGO no primary instance with error Readiness probe failed: HTTP probe failed with statuscode: 503

modigithub commented 1 year ago

Hello, at the moment I'm not sure if it's a bug. I installed pgo in my cluster exactly as instructed. I also modified the postgres.yml a bit. the last part why i'm posting a bug report here is that i can't connect to the svc hippo primary.

Overview

I have pgo according to instructions: https://access.crunchydata.com/documentation/postgres-operator/latest/quickstart/

installed in my cluster. Essentially everything worked. I had to provide some PV. I implemented pgAdmin to access the cluster. That's when I first noticed that postgresql instance didn't work

At the beginning I thought that the services didn't work here. However, everything is fine.

After that I checked the pods. Then I noticed that the instances were not completed.

Environment

Platform: (Kubernetes (Kubespray), Debian, Bare Metal,)
Kubernetes Version: 1.25.4
PGO Image Tag: ubi8-5.3.0-0
Postgres Version: 14
Storage: hostpath

Steps to Reproduce

Installation according to the instructions: Step 1: Download the examples Step 2: Install PGO, the Postgres operator Adaptation of the postgres.yaml: apiVersion: postgres-operator.crunchydata.com/v1beta1 kind: PostgresCluster metadata: name: hippo spec: image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-14.6-2 postgresVersion: 14 monitoring: pgmonitor: exporter: image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.3.0-0 instances:

name: instance1 replicas: 2 dataVolumeClaimSpec: accessModes:
- "ReadWriteOnce" resources: requests: storage: 1Gi backups: pgbackrest: image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.41-2 repos:
  - name: repo1 volume: volumeClaimSpec: accessModes:
    - "ReadWriteOnce" resources: requests: storage: 1Gi userInterface: pgAdmin: image: registry.developers.crunchydata.com/crunchydata/crunchy-pgadmin4:ubi8-4.30-4 dataVolumeClaimSpec: accessModes:
- "ReadWriteOnce" resources: requests: storage: 1Gi

REPRO

An error does not appear. (apart from not being able to connect) A hint lies in the pod event describtion: kubectl describe pod -n postgres-operator hippo-instance1-hd2m-0

EXPECTED

I want to be able to connect to the Postgres cluster.

ACTUAL

I can't connect due to the error message above.

Logs

I found out that something is wrong with the log. He obviously can't connect to anything here. I have neither any firewall rules nor blockades. So I can't say exactly why he's blocking something 2023-01-05 09:41:44,512 INFO: Lock owner: None; I am hippo-instance1-hd2m-0 2023-01-05 09:41:44,512 INFO: not healthy enough for leader race 2023-01-05 09:41:44,512 INFO: restarting after failure in progress /tmp/postgres:5432 - no response 2023-01-05 09:41:54,512 WARNING: Postgresql is not running. 2023-01-05 09:41:54,512 INFO: Lock owner: None; I am hippo-instance1-hd2m-0 2023-01-05 09:41:54,517 INFO: pg_controldata: pg_control version number: 1300 Catalog version number: 202107181 Database system identifier: 7184294301444526168 Database cluster state: shut down in recovery pg_control last modified: Thu Jan 5 09:18:38 2023 Latest checkpoint location: 0/C000180 Latest checkpoint's REDO location: 0/C000180 Latest checkpoint's REDO WAL file: 0000000C000000000000000C Latest checkpoint's TimeLineID: 12 Latest checkpoint's PrevTimeLineID: 12 Latest checkpoint's full_page_writes: on Latest checkpoint's NextXID: 0:749 Latest checkpoint's NextOID: 32768 Latest checkpoint's NextMultiXactId: 1 Latest checkpoint's NextMultiOffset: 0 Latest checkpoint's oldestXID: 726 Latest checkpoint's oldestXID's DB: 1 Latest checkpoint's oldestActiveXID: 0 Latest checkpoint's oldestMultiXid: 1 Latest checkpoint's oldestMulti's DB: 1 Latest checkpoint's oldestCommitTsXid: 0 Latest checkpoint's newestCommitTsXid: 0 Time of latest checkpoint: Tue Jan 3 10:34:34 2023 Fake LSN counter for unlogged rels: 0/3E8 Minimum recovery ending location: 0/C0020D0 Min recovery ending loc's timeline: 12 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no wal_level setting: logical wal_log_hints setting: on max_connections setting: 100 max_worker_processes setting: 8 max_wal_senders setting: 10 max_prepared_xacts setting: 0 max_locks_per_xact setting: 64 track_commit_timestamp setting: off Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Size of a large-object chunk: 2048 Date/time type storage: 64-bit integers Float8 argument passing: by value Data page checksum version: 1 Mock authentication nonce: f7bbed266ef770855f400b49b8a10707c7397a9ef3cb7ee24f56f6f1f286c4e5

2023-01-05 09:41:54,529 INFO: Lock owner: None; I am hippo-instance1-hd2m-0 2023-01-05 09:41:54,623 INFO: starting as a secondary 2023-01-05 09:41:54.819 UTC [4988] LOG: pgaudit extension initialized 2023-01-05 09:41:54,824 INFO: postmaster pid=4988 2023-01-05 09:41:54.831 UTC [4988] LOG: redirecting log output to logging collector process 2023-01-05 09:41:54.831 UTC [4988] HINT: Future log output will appear in directory "log". /tmp/postgres:5432 - no response /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections 2023-01-05 09:42:04,513 INFO: Lock owner: None; I am hippo-instance1-hd2m-0 2023-01-05 09:42:04,513 INFO: not healthy enough for leader race 2023-01-05 09:42:04,566 INFO: restarting after failure in progress /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections /tmp/postgres:5432 - rejecting connections

I only copied a small part here, as the areas are repeated.

Additional Information

I've been trying to solve the problem for 2 days now. Unfortunately without success. I've restarted the nodes several times and also deleted the pods quite often. I have also deleted and reinstalled pgo several times.

As already mentioned, I'm missing the primary instance. I think it has to do with that:

I hope that someone can help me. Unfortunately I'm at the end of my ideas now

In addition, the log writes me the following message: 2023-01-05 09:41:54,512 WARNING: Postgresql is not running.

I didn't find anything about it in the manual. I think it's not intended that I have to install and start postgres manually?

benjaminjb commented 1 year ago

I want to do a little experimenting over here, but also: any idea why your pgo and pgo-upgrade restarted 17 times in the last 25 hours?

benjaminjb commented 1 year ago

Also, first thought: I am curious to hear more about "I had to provide some PV" -- I don't see anything in the postgres.yaml that you provided about that, so my first thought there is: "if you're reusing storage, is the storage blank?"

modigithub commented 1 year ago

I have no idea why it starts so often. Someone else would have to tell us. As for the PV issue. I can say the following about this: At the beginning my instance stayed in a "Pending" status. I found the following in the documentation: https://access.crunchydata.com/documentation/postgres-operator/latest/tutorial/create-cluster/ So I expanded the configuration a bit. I extended the Kustomization.yaml configuration file to pv.yaml. Kustomization.yaml: pv.yaml:

The PV configuration was also only created for test purposes. Of course, this must be adapted for the productive environment. But I didn't care at the moment. I just wanted it to work.

However, that has nothing to do with my error/problem. So I would ask for ideas on my problem. Otherwise, please open your own issue.

benjaminjb commented 1 year ago

Hi @modigithub, to clear up the confusion, I'm part of the team at Crunchy Data responsible for the operator, so my questions were meant to debug this problem that you're experiencing, not to report my own issue.

When I install pgo and create a basic postgrescluster locally (in my dev env),

the pgo pod starts up and doesn't keep restarting;
the pods for the postgrecluster start up.

So as you identified, looks like postgres isn't starting up correctly. If I were you in your position, I would start with a radically simpler postgres.yaml just to eliminate other factors: say, replicas: 1, no monitor, no pgadmin.

But also: I would be curious to see what the logs of the pgo- and pgo-upgrade- pod look like -- those are the managers for the postgrescluster CRD (and the pgupgrade CRD), so I'm curious what's making them restart. Might be nothing, but just curious.

modigithub commented 1 year ago

I understood. Many thanks for the help. After another 2 whole days, I removed and uninstalled the entire cluster piece by piece. In the end I found the error. But I can't say exactly what the problem was.

I tried a lot back and forth at the very beginning. Something went wrong when saving the data in the pv. the solution was to delete the physical folders on the worker. After pgo recreated the folder, everything worked.

While it wasn't a very cool experience, I understand the architecture of pgo very well now. So the problem is solved.

Another question arose during this test. I have the option of specifying replicas for the instances. in my case i have 2 replicas. Is the second instance a standalone readonly replicated database managed by pgo. Or is it a placeholder in case the primary database fails. (for example if the server with the primary instance goes down)

It would be cool if you could answer that for me. You can then set the issue to closed.

benjaminjb commented 1 year ago

I think this page from the docs may help answer questions about the primary/secondary instances: https://access.crunchydata.com/documentation/postgres-operator/v5/architecture/high-availability/

Though you might also be interested in the test #2 on this page: https://access.crunchydata.com/documentation/postgres-operator/v5/tutorial/high-availability/#testing-your-ha-cluster

Hope that answers your questions!

CrunchyData / postgres-operator