PGO v5.3.1, shutdown master DB server, new master in recovery state for 20 min

Problem description

PGO v5.3.1, three worker nodes run with 3 postgres DB pods + 1 local backup repository so the local backup repository and one of the 3 DB pods must be running on the same worker node

When I shutdown the worker node where the master Postgres DB is running (to simulate power outage or network issue), if the backup repo pod also runs on this worker node, then the backup repo will be out of service also. In this situation, the new promoted mater will stay in recovery mode for 20 min.

My question is whether this 20 min can be reduced?

I understand that the best solution is to have one more worker node so postgres DB pods and backup repo pod can be deployed to independent worker nodes then shutting down the worker node with master DB won't impact backup repo. However, this is not doable in my environment. So the alternative could be: whether I can control the timer or threshold to let the new promoted master DB pod quit earlier from the state of waiting for reconnecting to backup repo?

Environment

Platform: OpenShift
Platform Version: 4.12 (I understand it's not officially supported yet but this problem doesn't look like platform related)
PGO Image Tag: ubi8-5.3.1-0
Postgres Version: 15
Storage: ocs-storagecluster-ceph-rbd

Steps to reproduce the issue

PostgreSQL DB has 3 replicas + 1 local backup repo running on 3 worker nodes

# oc get pods -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME                        READY   STATUS      RESTARTS   AGE     IP            NODE             NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-backup-h6nw-rqqxz   0/1     Completed   0          7m4s    10.130.2.92   w02.xxxxxxxxxx   <none>           <none>
vdmfpgo-instance-848h-0     4/4     Running     0          7m20s   10.128.2.33   w01.xxxxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-jltz-0     4/4     Running     0          7m20s   10.131.0.21   w03.xxxxxxxxxx   <none>           <none>            master
vdmfpgo-instance-kpmd-0     4/4     Running     0          7m20s   10.130.2.91   w02.xxxxxxxxxx   <none>           <none>            replica
vdmfpgo-repo-host-0         2/2     Running     0          7m20s   10.131.0.22   w03.xxxxxxxxxx   <none>           <none>

The master DB (jltz) is running on w03, which has backup repo (vdmfpgo-repo-host-0) running on as well

Shut down the w03 worker node at 14:11:31 UTC, wait a bit and check pods status again:

# oc get pods -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME                        READY   STATUS      RESTARTS   AGE     IP            NODE             NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-backup-h6nw-rqqxz   0/1     Completed   0          8m49s   10.130.2.92   w02.xxxxxxxxxx   <none>           <none>
vdmfpgo-instance-848h-0     4/4     Running     0          9m5s    10.128.2.33   w01.xxxxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-jltz-0     4/4     Running     0          9m5s    10.131.0.21   w03.xxxxxxxxxx   <none>           <none>            master
vdmfpgo-instance-kpmd-0     4/4     Running     0          9m5s    10.130.2.91   w02.xxxxxxxxxx   <none>           <none>            master
vdmfpgo-repo-host-0         2/2     Running     0          9m5s    10.131.0.22   w03.xxxxxxxxxx   <none>           <none>

PODs "STATUS" has not been changed yet (it will take about 5 min, which is normal) but the old DB master pod (jltz) and backup repo (vdmfpgo-repo-host-0) are all unavailable at this point. The new promoted master is vdmfpgo-instance-kpmd-0

# oc exec -it vdmfpgo-instance-kpmd-0 -- bash
Defaulted container "database" out of: database, replication-cert-copy, pgbackrest, pgbackrest-config, postgres-startup (init), nss-wrapper-init (init)
bash-4.4$ psql -h localhost -U hippo vdmfpgo
Password for user hippo:
psql (15.2)
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off)
Type "help" for help.

vdmfpgo=> SELECT pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 t
(1 row)

Kept checking the state of the new prompted DB, "pg_is_in_recovery()" turns to "f" after about 20 min.

Check the logs on the new master DB pod (kpmd): 14:11:45 UTC: w03 worker node was shut down, so the connection to primary DB was terminated

FATAL:  could not receive data from WAL stream: FATAL:  terminating connection due to administrator command

Then this pod received promote request

2023-06-16 14:19:37.646 UTC [99] LOG:  received promote request

However, from DB log, I can see that the DB on this pod actually was trying to find a valid backup repository

WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
ERROR: [103]: unable to find a valid repository

Until 14:31:39, the new master pod completed the archive recovery (looks like a timeout and no more waiting for the valid backup repo) and then DB on this pod was changed to "read-write" mode

2023-06-16 14:31:39.408 UTC [99] LOG:  archive recovery complete
2023-06-16 14:31:39.430 UTC [97] LOG:  checkpoint starting: force
2023-06-16 14:31:39.435 UTC [94] LOG:  database system is ready to accept connections

If the backup repo is online (i.e. backup repo run on different worker node from the old master db), then db failover is fairly quick.

Hello @stanleyyzhu,

Thank you for the detailed description of your scenario. One other thing that might be helpful would be a copy of your PostgresCluster manifest, particularly if you are using any customized Postgres settings, as shown in our documentation.

Thanks a lot tjmoore4 for your guide. Here is the manifest that I'm using:

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: vdmfpgo
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-15.2-0
  postgresVersion: 15
  instances:
    - name: instance
      replicas: 3
      resources:
        requests:
          cpu: 4.0
          memory: 10Gi
        limits:
          cpu: 8.0
          memory: 16Gi
      dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 50Gi
      walVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 5Gi
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - topologyKey: kubernetes.io/hostname
            labelSelector:
              matchLabels:
                postgres-operator.crunchydata.com/cluster: vdmfpgo
                postgres-operator.crunchydata.com/instance-set: instance
  users:
    - name: hippo
      databases:
        - vdmfpgo
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.41-4
      repos:
      - name: repo1
        schedules:
          full: "7 1 * * *"
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 50Gi
      global:
        repo1-retention-full: "3"
      manual:
        repoName: repo1
        options:
         - --type=full
      restore:
        enabled: false
        repoName: repo1
        options:
        - --type=time
        - --target="2021-06-09 14:15:11-04"
  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          max_connections: 500
          shared_buffers: 3GB
          wal_buffers: 16MB
    switchover:
      enabled: false
      # targetInstance: vdmfpgo-instance-example

@stanleyyzhu Thank you for providing your manifest. Two separate options might be worth exploring in more detail.

Certain Postgres configuration adjustments would likely minimize the cluster's wait time. For example, setting checkpoint_timeout might minimize the first wait you mentioned.

That said, a more robust option would be to define another pgBackRest repo in addition to your local repo-host. As described in the "Set Up Multiple Backup Repositories" section of the documentation, this would allow you to store backups in cloud based storage in addition to your local backups which I would expect to minimize your recovery time as well.

Hope this helps! Feel free to let us know how any further testing goes.

Many thanks @tjmoore4 for your suggestions

I'll test checkpoint_timeout later today. Currently the value of checkpoint_timeout in my test environment is the default one (5 min)

Regarding multiple repos, I actually thought about this but it might not work. Reason is that the 2 repos will be managed by the same repo-host if my understanding is correct. (please correct me if I'm wrong)

We are not allowed to use public clouds but I did a test with 2 local repos. IMHO one local PV repo + one remote repo would have the same thing:

Two repos are all managed by the same repo-host (i.e. one pod), and the local repo is a block device with "RWO", this local PVC cannot be mounted on to multiple instances.

When the worker node is down, K8S/OpenShift has no time to unmount the PVC from the old repo-host so the old repo-host will be stuck at status "TERMINATING". Consequently, K8S/OpenShift won't be able to create a new one to take over the resources for repo-host (doesn't matter the 2nd repo is local or remote).

Correct me if I'm wrong.

In the meantime, speaking of remote backup, I actually also have a side question: Based on pgbackrest, looks like SFTP is an option. However, this option is not mentioned in PGO v5 doc. Refer to STORAGE TYPE.

I'm wondering if there is any plan to have SFTP added in as an option? or if there is any reason PGO doesn't want to use SFTP? I think this could be a moderate supplementary for the environment where public cloud is not allowed.

Thanks a lot for your help

This is a quick update that I have changed checkpoint_timeout to 1 min but the wait time is still 20 min.

The difficulty here is that log doesn't show me what is running during the 20 min so it's hard to find out the right parameter to adjust :(

vdmfpgo=> show checkpoint_timeout;
 checkpoint_timeout
--------------------
 1min
(1 row)

vdmfpgo=>

Logs from the new master:

2023-06-22 17:08:18.216 UTC [377] FATAL:  could not receive data from WAL stream: FATAL:  terminating connection due to administrator command
2023-06-22 17:08:35.978 UTC [107] LOG:  restartpoint starting: time
2023-06-22 17:08:45.998 UTC [107] LOG:  restartpoint complete: wrote 101 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=9.973 s, sync=0.014 s, total=10.020 s; sync files=38, longest=0.011 s, average=0.001 s; distance=784 kB, estimate=24416 kB
2023-06-22 17:08:45.998 UTC [107] LOG:  recovery restart point at 20/AF0C77F8
2023-06-22 17:08:45.998 UTC [107] DETAIL:  Last completed transaction was at log time 2023-06-22 17:08:17.984396+00.
2023-06-22 17:08:46.697 UTC [965] ERROR:  cannot execute nextval() in a read-only transaction
2023-06-22 17:08:46.697 UTC [965] STATEMENT:  select nextval ('hibernate_sequence')

...
...

2023-06-22 17:27:43.548 UTC [109] LOG:  archive recovery complete
2023-06-22 17:27:43.561 UTC [803] ERROR:  cannot execute INSERT in a read-only transaction
2023-06-22 17:27:43.561 UTC [803] STATEMENT:  INSERT INTO syncLoopPrevent (hashcode, accepted_at) VALUES($1, $2)
2023-06-22 17:27:43.597 UTC [107] LOG:  checkpoint starting: force
2023-06-22 17:27:43.603 UTC [103] LOG:  database system is ready to accept connections
2023-06-22 17:27:44.483 UTC [107] LOG:  checkpoint complete: wrote 10 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.827 s, sync=0.017 s, total=0.887 s; sync files=9, longest=0.011 s, average=0.002 s; distance=16222 kB, estimate=23596 kB
2023-06-22 17:28:43.499 UTC [107] LOG:  checkpoint starting: time
2023-06-22 17:28:55.921 UTC [107] LOG:  checkpoint complete: wrote 124 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=12.323 s, sync=0.043 s, total=12.422 s; sync files=43, longest=0.021 s, average=0.001 s; distance=865 kB, estimate=21323 kB

Hello @stanleyyzhu. Sorry to hear that your adjustments so far haven't resulted in any improvement. A couple of thoughts to consider. You mentioned that public cloud solutions are not an option, but would a self hosted S3-compatible system (like MinIO) be an option for testing? That should also allow for backups without requiring an additional PVC.

Also, we were wondering, in your original scenario, what your Pods and Nodes looked like when things started working again and your primary was back up. Was the w03 Node back online? If so, was the repo host running on w03 again or did it migrate to another Node? If w03 did not come back online, does the "TERMINATING" status ever resolve? Is the repo-host able to migrate to another Node?

Hi @tjmoore4 , sorry for the delay. I was tied up.

I'll deploy a MinIO server and share the test result ASAP (please bear with me as I have no experience with MinIO, not sure if it would be easy to bring up one).

And thanks a lot for introducing MinIO, I didn't know we could deploy S3-compatible system locally.

About pods/nodes allocation/status change:

what your Pods and Nodes looked like when things started working again and your primary was back up When things started working again, the repo-host pod was still not reachable and the primary DB was still unable to connect to the backup repo. The old master pod as well as the repo-host status were still "Terminating"

Was the w03 Node back online No not yet. When new master DB became read-write (service recovered) after 20 min, the w03 worker node was still down, I was still able to see unable to find a valid repository in the new master DB log. And the repo-host STATUS remained "Terminating" and no any other new repo-host pod was created (because the PVC used by this repo-host couldn't be released, OpenShift couldn't bring up another pod on another worker node in this case)

Below is the test result:

Pod allocation before shutting down a work node:

# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME                         READY   STATUS      RESTARTS   AGE     IP             NODE            NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-hx2t-0      4/4     Running     0          5d4h    10.131.1.39    w03.xxxxxxxx    <none>           <none>            master
vdmfpgo-instance-lcsd-0      4/4     Running     0          15m     10.130.2.73    w02.xxxxxxxx    <none>           <none>            replica
vdmfpgo-instance-wwq8-0      4/4     Running     0          14m     10.128.2.170   w01.xxxxxxxx    <none>           <none>            replica
vdmfpgo-repo-host-0          2/2     Running     0          5d3h    10.131.1.66    w03.xxxxxxxx    <none>           <none>

The master DB instance and local repo-host were assigned to w03.

Then I shut down the w03 worker node at 21:53:36 UTC (w03 work node "STATUS" is "NotReady")

# oc get node
NAME            STATUS     ROLES                  AGE   VERSION
is01.xxxxxxxx   Ready      infra                  25d   v1.25.7+eab9cc9
is02.xxxxxxxx   Ready      infra                  25d   v1.25.7+eab9cc9
is03.xxxxxxxx   Ready      infra                  25d   v1.25.7+eab9cc9
m01.xxxxxxxx    Ready      control-plane,master   25d   v1.25.7+eab9cc9
m02.xxxxxxxx    Ready      control-plane,master   25d   v1.25.7+eab9cc9
m03.xxxxxxxx    Ready      control-plane,master   25d   v1.25.7+eab9cc9
w01.xxxxxxxx    Ready      worker                 25d   v1.25.7+eab9cc9
w02.xxxxxxxx    Ready      worker                 25d   v1.25.7+eab9cc9
w03.xxxxxxxx    NotReady   worker                 25d   v1.25.7+eab9cc9

PGO detected the unreachable of master DB, instance "vdmfpgo-instance-wwq8-0" was promoted as a new master

# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME                         READY   STATUS      RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-hx2t-0      4/4     Running     0          5d4h    10.131.1.39    w03.xxxxxxxx   <none>           <none>            master
vdmfpgo-instance-lcsd-0      4/4     Running     0          33m     10.130.2.73    w02.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-wwq8-0      4/4     Running     0          33m     10.128.2.170   w01.xxxxxxxx   <none>           <none>            master
vdmfpgo-repo-host-0          2/2     Running     0          5d4h    10.131.1.66    w03.xxxxxxxx   <none>           <none>

At this point, the new master DB was read-only because it tried to connect to the repo-host which was also not reachable

vdmfpgo=> SELECT pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 t
(1 row)

About 5-6 min later (~ 21:59 UTC), OpenShift started to try to terminate the pods running on w03 worker node

# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME                         READY   STATUS        RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-hx2t-0      4/4     Terminating   0          5d4h    10.131.1.39    w03.xxxxxxxx   <none>           <none>            master
vdmfpgo-instance-lcsd-0      4/4     Running       0          35m     10.130.2.73    w02.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-wwq8-0      4/4     Running       0          35m     10.128.2.170   w01.xxxxxxxx   <none>           <none>            master
vdmfpgo-repo-host-0          2/2     Terminating   0          5d4h    10.131.1.66    w03.xxxxxxxx   <none>           <none>

This status will remain forever (below I will post the status after waiting for 1 hour).

About 20 min later since shutting down w03 worker node (~22:14 UTC), the new master DB became read-write (i.e. recovered)

vdmfpgo=> SELECT CURRENT_TIMESTAMP;SELECT pg_is_in_recovery();
       current_timestamp
-------------------------------
 2023-06-27 22:14:02.735572+00
(1 row)

 pg_is_in_recovery
-------------------
 f
(1 row)

The pods with "STATUS=Terminating" remained. And I double checked the new master DB logs, confirmed that "unable to find a valid repository" error still popped up (to confirm that local repo-host was still down)

2023-06-27 22:17:35.413 UTC [108] LOG:  checkpoint starting: time
ERROR: [103]: unable to find a valid repository:
       repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-27 22:17:35.799 UTC [3398] LOG:  archive command failed with exit code 103
2023-06-27 22:17:35.799 UTC [3398] DETAIL:  The failed archive command was: pgbackrest --stanza=db archive-push "pg_wal/00000008.history"
2023-06-27 22:17:45.314 UTC [108] LOG:  checkpoint complete: wrote 96 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=9.824 s, sync=0.023 s, total=9.902 s; sync files=39, longest=0.020 s, average=0.001 s; distance=16446 kB, estimate=16446 kB

Left system there for about 1 hour, checked again at about 23:17 UTC, pods with "STATUS=Terminating" were still there, and local repo-host was still down

# date; oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
Tue 27 Jun 2023 07:18:23 PM EDT
NAME                         READY   STATUS        RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-hx2t-0      4/4     Terminating   0          5d6h    10.131.1.39    w03.xxxxxxxx   <none>           <none>            master
vdmfpgo-instance-lcsd-0      4/4     Running       0          113m    10.130.2.73    w02.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-wwq8-0      4/4     Running       0          113m    10.128.2.170   w01.xxxxxxxx   <none>           <none>            master
vdmfpgo-repo-host-0          2/2     Terminating   0          5d5h    10.131.1.66    w03.xxxxxxxx   <none>           <none>

2023-06-27 23:20:35.105 UTC [108] LOG:  checkpoint starting: time
2023-06-27 23:20:46.205 UTC [108] LOG:  checkpoint complete: wrote 110 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=11.037 s, sync=0.018 s, total=11.101 s; sync files=42, longest=0.010 s, average=0.001 s; distance=16114 kB, estimate=16560 kB
2023-06-27 23:21:35.226 UTC [108] LOG:  checkpoint starting: time
2023-06-27 23:21:45.655 UTC [108] LOG:  checkpoint complete: wrote 102 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=10.336 s, sync=0.009 s, total=10.430 s; sync files=39, longest=0.008 s, average=0.001 s; distance=16354 kB, estimate=16539 kB
ERROR: [103]: unable to find a valid repository:
       repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-27 23:21:54.620 UTC [3398] LOG:  archive command failed with exit code 103
2023-06-27 23:21:54.620 UTC [3398] DETAIL:  The failed archive command was: pgbackrest --stanza=db archive-push "pg_wal/00000008.history"

vdmfpgo=> SELECT CURRENT_TIMESTAMP;SELECT pg_is_in_recovery();
      current_timestamp
------------------------------
 2023-06-27 23:23:14.64303+00
(1 row)

 pg_is_in_recovery
-------------------
 f
(1 row)

vdmfpgo=> create table test (ok bool);
CREATE TABLE
vdmfpgo=> select * from test;
 ok
----
(0 rows)

vdmfpgo=> drop table test;
DROP TABLE
vdmfpgo=>

Then I started up the worker node w03, waited less than 1 min, those "STATUS=Terminating" pods were all recreated: The old master DB was recreated on the same worker node because of "podAntiAffinity=requiredDuringSchedulingIgnoredDuringExecution" policy (I have only 3 worker nodes, each for one DB replica), The local repo-host was re-deployed on another worker node (the new one was w02)

# date; oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
Tue 27 Jun 2023 07:28:22 PM EDT
NAME                        READY   STATUS      RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-hx2t-0     3/4     Running     0          51s     10.131.1.80    w03.xxxxxxxx   <none>           <none>
vdmfpgo-instance-lcsd-0     4/4     Running     0          123m    10.130.2.73    w02.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-wwq8-0     4/4     Running     0          123m    10.128.2.170   w01.xxxxxxxx   <none>           <none>            master
vdmfpgo-repo-host-0         2/2     Running     0          50s     10.130.2.149   w02.xxxxxxxx   <none>           <none>

And then all went back to normal.

I successfully made MinIO work with PGO today, which is not as difficult as I thought if only for testing purpose :)

Test result aligns with my previous guess

Summary

Because w03 shuts down, OpenShift doesn't have chance to unmount the local backup-repo PVC gracefully, so the repo-host-0 is stuck at status "Terminating". Therefore, OpenShift can't/won't schedule another repo-host pod to take over the job. Thus, the repo-host is not reachable.

Since both "local repo" and "MinIO repo" are all mounted to and managed by the same repo-host, the MinIO repo is also unusable though it's a remote backup-repo.

Seems that valid solutions could be

The best solution could be to have DB replicas and backup-repo-host distributed to different worker nodes. If this is not possible (which is my case), either leave it as-is (at least service will still be back after 20 min), or find the proper configure parameters to reduce the time for waiting DB recovery in the case when repo-host is not reachable.

Maybe a potentially possible solution could be: let each backup-repo be managed by individual repo-host (e.g. one repo-host pod manages local repo and another repo-host pod manages MinIO repo, and so on). But very likely this will break the whole current backup management logic, or even not doable at all.

Test procedures and test result

The "backups" portion of the new manifest is attached. MinIO is added in as repo2:

  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.41-4
      configuration:
      - secret:
          name: pgo-multi-repo-creds
      repos:
      - name: repo1
        schedules:
          full: "7 1 * * *"
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 50Gi
      - name: repo2
        s3:
          bucket: vdmf-minio
          endpoint: 172.21.103.21:9000
          region: srbhpp_minio
      global:
        repo1-retention-full: "3"
        repo2-s3-uri-style: path
        repo2-path: /pgbackrest/vdmfpgo/hippo-multi-repo/repo2
        repo2-storage-verify-tls: n
      manual:
        repoName: repo2
        options:
         - --type=full
      restore:
        enabled: false
        repoName: repo1
        options:
        - --type=time
        - --target="2021-06-09 14:15:11-04"

oc describe postgrescluster/vdmfpgo showed me that "Stanza Created = true" for both of the 2 repos (local + MinIO)

# oc describe postgrescluster/vdmfpgo
Name:         vdmfpgo
...
...
Status:
  Conditions:
    Last Transition Time:  2023-06-28T20:36:01Z
    Message:               pgBackRest dedicated repository host is ready
    Observed Generation:   1
    Reason:                RepoHostReady
    Status:                True
    Type:                  PGBackRestRepoHostReady
    Last Transition Time:  2023-06-28T20:35:57Z
    Message:               pgBackRest replica create repo is ready for backups
    Observed Generation:   1
    Reason:                StanzaCreated
    Status:                True
    Type:                  PGBackRestReplicaRepoReady
    Last Transition Time:  2023-06-28T20:39:07Z
    Message:               pgBackRest replica creation is now possible
    Observed Generation:   1
    Reason:                RepoBackupComplete
    Status:                True
    Type:                  PGBackRestReplicaCreate
    Last Transition Time:  2023-06-28T20:40:39Z
    Message:               Manual backup completed successfully
    Observed Generation:   1
    Reason:                ManualBackupComplete
    Status:                True
    Type:                  PGBackRestManualBackupSuccessful
  Database Revision:       69998879ff
  Instances:
    Name:              instance
    Ready Replicas:    3
    Replicas:          3
    Updated Replicas:  3
  Monitoring:
    Exporter Configuration:  559c4c97d6
  Observed Generation:       1
  Patroni:
    System Identifier:  7249838429033578591
  Pgbackrest:
    Manual Backup:
      Completion Time:  2023-06-28T20:40:37Z
      Finished:         true
      Id:               Wed 28 Jun 2023 04:40:17 PM EDT
      Start Time:       2023-06-28T20:40:18Z
      Succeeded:        1
    Repo Host:
      API Version:  apps/v1
      Kind:         StatefulSet
      Ready:        true
    Repos:
      Bound:                           true
      Name:                            repo1
      Replica Create Backup Complete:  true
      Stanza Created:                  true
      Volume:                          pvc-c08a3c08-779f-4d48-b258-ae050d8470c2
      Name:                            repo2
      Repo Options Hash:               586ff7cb44
      Stanza Created:                  true
  Proxy:
    Pg Bouncer:
      Postgres Revision:  5c9966f6bc
  Users Revision:         7645b7fc86
Events:
  Type    Reason           Age                  From                        Message
  ----    ------           ----                 ----                        -------
  Normal  RepoHostCreated  5m24s                postgrescluster-controller  created pgBackRest repository host StatefulSet/vdmfpgo-repo-host
  Normal  StanzasCreated   5m4s (x2 over 5m7s)  postgrescluster-controller  pgBackRest stanza creation completed successfully

Ran a manual backup then checked MinIO, the backup was there. i.e. integration was good.

OK. Now the environment is ready for testing. Both "local repo" and "MinIO repo" are all managed by the same repo-host-0 pod.

pods allocation and status before shutting down worker node:

# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-m55r-0   4/4     Running   0          7m22s   10.128.2.213   w01.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-tgk8-0   4/4     Running   0          7m22s   10.130.3.169   w02.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-z28k-0   4/4     Running   0          7m22s   10.131.0.83    w03.xxxxxxxx   <none>           <none>            master
vdmfpgo-repo-host-0       2/2     Running   0          7m21s   10.131.0.84    w03.xxxxxxxx   <none>           <none>

The master DB and repo-host-0 were all running on worker node "w03"

Shut down worker node "w03" at 20:44:39 UTC

-sh-4.2$ openstack server stop w03 && date
Wed Jun 28 16:44:39 EDT 2023
-sh-4.2$

PGO detected the unreachable status and promoted "vdmfpgo-instance-tgk8-0" as the new master

# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-m55r-0   4/4     Running   0          9m38s   10.128.2.213   w01.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-tgk8-0   4/4     Running   0          9m38s   10.130.3.169   w02.xxxxxxxx   <none>           <none>            master
vdmfpgo-instance-z28k-0   4/4     Running   0          9m38s   10.131.0.83    w03.xxxxxxxx   <none>           <none>            master
vdmfpgo-repo-host-0       2/2     Running   0          9m37s   10.131.0.84    w03.xxxxxxxx   <none>           <none>

Checked the new primary DB status. It was still in "recovery" mode

vdmfpgo=> SELECT CURRENT_TIMESTAMP;SELECT pg_is_in_recovery();
       current_timestamp
-------------------------------
 2023-06-28 20:47:10.530067+00
(1 row)

 pg_is_in_recovery
-------------------
 t
(1 row)

Checked the logs on the new primary server, connecting to repo-host-0 timeout (as w03 was down)

2023-06-28 20:42:48.546 UTC [200] LOG:  restartpoint complete: wrote 85 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=8.392 s, sync=0.015 s, total=8.455 s; sync files=37, longest=0.011 s, average=0.001 s; distance=98304 kB, estimate=98304 kB
2023-06-28 20:42:48.546 UTC [200] LOG:  recovery restart point at 0/A000028
2023-06-28 20:42:48.546 UTC [200] DETAIL:  Last completed transaction was at log time 2023-06-28 20:37:30.045233+00.
2023-06-28 20:44:39.208 UTC [215] FATAL:  could not receive data from WAL stream: FATAL:  terminating connection due to administrator command
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'

About 5 min later, OpenShift started to terminate those pods on w03, including the old primary DB and repo-host-0

# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide; date
NAME                      READY   STATUS        RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-m55r-0   4/4     Running       0          14m   10.128.2.213   w01.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-tgk8-0   4/4     Running       0          14m   10.130.3.169   w02.xxxxxxxx   <none>           <none>            master
vdmfpgo-instance-z28k-0   4/4     Terminating   0          14m   10.131.0.83    w03.xxxxxxxx   <none>           <none>            master
vdmfpgo-repo-host-0       2/2     Terminating   0          14m   10.131.0.84    w03.xxxxxxxx   <none>           <none>

Because the repo-host-0 cannot be terminated smoothly (as the local PVC cannot be unmounted successfully), there was no any new repo-host pod created.

So the repos are still unreachable (both "local repo" and "MinIO repo" are all managed by the same repo-host. When repo-host is not reachable, the 2 repos are not usable)

Check pod status, DB status and primary DB logs at around 21:00 UTC (about 15 min after w03 down), which shows that the DB is still under read-only (recovery) status

# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide; date
NAME                      READY   STATUS        RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-m55r-0   4/4     Running       0          24m   10.128.2.213   w01.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-tgk8-0   4/4     Running       0          24m   10.130.3.169   w02.xxxxxxxx   <none>           <none>            master
vdmfpgo-instance-z28k-0   4/4     Terminating   0          24m   10.131.0.83    w03.xxxxxxxx   <none>           <none>            master
vdmfpgo-repo-host-0       2/2     Terminating   0          24m   10.131.0.84    w03.xxxxxxxx   <none>           <none>

vdmfpgo=> SELECT CURRENT_TIMESTAMP;SELECT pg_is_in_recovery();
       current_timestamp
-------------------------------
 2023-06-28 21:00:41.222543+00
(1 row)

 pg_is_in_recovery
-------------------
 t
(1 row)

bash-4.4$ tail -f /pgdata/pg15/log/postgresql-Wed.log
2023-06-28 20:37:40.922 UTC [202] LOG:  restored log file "000000010000000000000005" from archive
2023-06-28 20:37:41.187 UTC [202] LOG:  restored log file "000000010000000000000006" from archive
2023-06-28 20:37:41.400 UTC [202] LOG:  consistent recovery state reached at 0/605E828
2023-06-28 20:37:41.400 UTC [197] LOG:  database system is ready to accept read-only connections
2023-06-28 20:37:41.442 UTC [215] LOG:  started streaming WAL from primary at 0/7000000 on timeline 1
2023-06-28 20:42:40.092 UTC [200] LOG:  restartpoint starting: time
2023-06-28 20:42:48.546 UTC [200] LOG:  restartpoint complete: wrote 85 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=8.392 s, sync=0.015 s, total=8.455 s; sync files=37, longest=0.011 s, average=0.001 s; distance=98304 kB, estimate=98304 kB
2023-06-28 20:42:48.546 UTC [200] LOG:  recovery restart point at 0/A000028
2023-06-28 20:42:48.546 UTC [200] DETAIL:  Last completed transaction was at log time 2023-06-28 20:37:30.045233+00.
2023-06-28 20:44:39.208 UTC [215] FATAL:  could not receive data from WAL stream: FATAL:  terminating connection due to administrator command
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-28 20:52:31.334 UTC [202] LOG:  invalid record length at 0/B000060: wanted 24, got 0
2023-06-28 20:52:31.334 UTC [202] LOG:  received promote request
2023-06-28 20:52:31.334 UTC [202] LOG:  redo done at 0/B000028 system usage: CPU: user: 0.00 s, system: 0.02 s, elapsed: 890.52 s
2023-06-28 20:52:31.334 UTC [202] LOG:  last completed transaction was at log time 2023-06-28 20:37:30.045233+00
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-28 21:00:32.362 UTC [202] LOG:  selected new timeline ID: 2

About 20 min later, checked pod status, DB status and new primary DB logs again. The primary DB forced back to read-write mode, DB writing function became normal. PODs of the old primary DB and repo-host-0 still remained in "Terminating" status, and the new primary DB log still showed "timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'"

# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide; date
NAME                      READY   STATUS        RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-m55r-0   4/4     Running       0          33m   10.128.2.213   w01.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-tgk8-0   4/4     Running       0          33m   10.130.3.169   w02.xxxxxxxx   <none>           <none>            master
vdmfpgo-instance-z28k-0   4/4     Terminating   0          33m   10.131.0.83    w03.xxxxxxxx   <none>           <none>            master
vdmfpgo-repo-host-0       2/2     Terminating   0          33m   10.131.0.84    w03.xxxxxxxx   <none>           <none>

vdmfpgo=> SELECT CURRENT_TIMESTAMP;SELECT pg_is_in_recovery();
       current_timestamp
-------------------------------
 2023-06-28 21:09:25.320753+00
(1 row)

 pg_is_in_recovery
-------------------
 f
(1 row)

vdmfpgo=> create table test (f1 bool);
CREATE TABLE

2023-06-28 21:00:32.362 UTC [202] LOG:  selected new timeline ID: 2
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-28 21:04:33.434 UTC [202] LOG:  archive recovery complete
2023-06-28 21:04:33.499 UTC [200] LOG:  checkpoint starting: force
2023-06-28 21:04:33.505 UTC [197] LOG:  database system is ready to accept connections
2023-06-28 21:04:33.595 UTC [200] LOG:  checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.023 s, sync=0.008 s, total=0.096 s; sync files=2, longest=0.004 s, average=0.004 s; distance=16384 kB, estimate=90112 kB

ERROR: [104]: archive-push command encountered error(s):
       repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-28 21:08:34.028 UTC [1875] LOG:  archive command failed with exit code 104
2023-06-28 21:08:34.028 UTC [1875] DETAIL:  The failed archive command was: pgbackrest --stanza=db archive-push "pg_wal/00000002.history"
2023-06-28 21:09:33.895 UTC [200] LOG:  checkpoint starting: time
2023-06-28 21:09:34.255 UTC [200] LOG:  checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.107 s, sync=0.029 s, total=0.360 s; sync files=1, longest=0.029 s, average=0.029 s; distance=16383 kB, estimate=82739 kB
2023-06-28 21:09:40.980 UTC [899] STATEMENT:  create table test (f1 bool);

Then I started w03 worker node and service were all recovered. repo-host-0 was re-scheduled to w01 worker node

@stanleyyzhu Thank you for the detailed information, this has been very helpful in understanding your scenario. We've done a bit of testing, but we haven't been able to replicate your exact scenario. One thing we were curious about was the describe output from your repo-host Pod after you shut-down your w03 node, i.e. kubectl describe pod <repo host pod name>.

Thank you very much @tjmoore4 and PGO team,

Below are the describe outputs of repo-host-0 at different stages.

As you mentioned you couldn't replicate the scenario, wondering if your repo-host was stuck at "Terminating" status? Basically my OpenShift environment uses "ocs-storagecluster-ceph-rbd" as storageclass and the accessmode is "ReadWriteOnce". The "AccessMode" in the PGO manifest is "ReadWriteOnce" by default. If your team couldn't see "Terminating" status hanging there, not sure if your environment used filesystem or block device as storage (I kinda recall cephfs behavior is quite different from ceph-rbd)

BTW, it might be unfair to let you and your team spends too much time on this (as this is the community version). I'll be more than happy to continue to test and return info if your team is interested in this scenario. But if your team doesn't have much time, It's totally understandable and I'm OK as well (at least service can still be back after about 20 min)

# oc get pvc | grep vdmfpgo
vdmfpgo-instance-4dqs-pgdata           Bound    pvc-03743f0b-ad81-4663-b51e-e62607158b94   50Gi       RWO            ocs-storagecluster-ceph-rbd   21h
vdmfpgo-instance-4dqs-pgwal            Bound    pvc-0247f287-5cbc-4955-af02-d331897e3cbe   5Gi        RWO            ocs-storagecluster-ceph-rbd   21h
vdmfpgo-instance-64t9-pgdata           Bound    pvc-d5d6c501-cf46-4b35-a852-8fb589660c88   50Gi       RWO            ocs-storagecluster-ceph-rbd   21h
vdmfpgo-instance-64t9-pgwal            Bound    pvc-e5e64d31-ac33-4315-a462-a249fcd5d910   5Gi        RWO            ocs-storagecluster-ceph-rbd   21h
vdmfpgo-instance-c5k8-pgdata           Bound    pvc-d08d3f3a-63d8-41de-aa1d-a496ff300e14   50Gi       RWO            ocs-storagecluster-ceph-rbd   21h
vdmfpgo-instance-c5k8-pgwal            Bound    pvc-6a715a9b-8461-4e5e-95dc-a905ffa7d838   5Gi        RWO            ocs-storagecluster-ceph-rbd   21h
vdmfpgo-repo1                          Bound    pvc-a956c736-3e24-4de0-b118-d601bfa4bfbc   50Gi       RWO            ocs-storagecluster-ceph-rbd   21h

# oc get pvc/vdmfpgo-repo1 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    pv.kubernetes.io/bind-completed: "yes"
    pv.kubernetes.io/bound-by-controller: "yes"
    volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
    volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
  creationTimestamp: "2023-06-28T21:56:29Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    postgres-operator.crunchydata.com/cluster: vdmfpgo
    postgres-operator.crunchydata.com/data: pgbackrest
    postgres-operator.crunchydata.com/pgbackrest: ""
    postgres-operator.crunchydata.com/pgbackrest-repo: repo1
    postgres-operator.crunchydata.com/pgbackrest-volume: ""
  name: vdmfpgo-repo1
  namespace: vdmf-system
  ownerReferences:
  - apiVersion: postgres-operator.crunchydata.com/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: PostgresCluster
    name: vdmfpgo
    uid: ab3d43fe-bb43-4157-8039-1a1b811861ba
  resourceVersion: "22200322"
  uid: a956c736-3e24-4de0-b118-d601bfa4bfbc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: ocs-storagecluster-ceph-rbd
  volumeMode: Filesystem
  volumeName: pvc-a956c736-3e24-4de0-b118-d601bfa4bfbc
status:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 50Gi
  phase: Bound

Summary of what I saw from the describe of repo-host-0

After shutting down the worker node, repo-host-0 popped up a warning event "NodeNotReady" and "Conditions.Ready" became "False" This status remained for 5 min until openshift tried to terminate the pod, then the "Status" in the repo-host-0 describe turned to "Terminating" then no more change

Test procedure and test result

Before I shutdown any worker node, master DB pod and repo-host pod are running on w01

# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-4dqs-0   4/4     Running   0          20h     10.128.2.230   w01.xxxxxxxx   <none>           <none>            master
vdmfpgo-instance-64t9-0   4/4     Running   0          20h     10.130.3.178   w02.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-c5k8-0   4/4     Running   0          3m32s   10.131.0.105   w03.xxxxxxxx   <none>           <none>            replica
vdmfpgo-repo-host-0       2/2     Running   0          20h     10.128.2.237   w01.xxxxxxxx   <none>           <none>

At this moment, the describe output of repo-host is: PGOv5_3673_describe_repo-host-0_b4_turnoff_w01.txt

Then I shut down w01, wait for about 30 sec, the describe output of repo-host is: PGOv5_3673_describe_repo-host-0_after_turnoff_w01_30sec.txt At this moment, "Conditions.Ready" turned to "false", and a warning event "NodeNotReady" popped up

Then I checked the describe again about 4 min later (before OpenShift terminating the repo-host-0): Same as the previous describe PGOv5_3673_describe_repo-host-0_after_turnoff_w01_4min.txt

About 5-6 min later, OpenShift changed the repo-host-0 status to "Terminating",

# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME                      READY   STATUS        RESTARTS   AGE   IP             NODE           NOMINATED NODE   READINESS GATES   ROLE
vdmfpgo-instance-4dqs-0   4/4     Terminating   0          20h   10.128.2.230   w01.xxxxxxxx   <none>           <none>            master
vdmfpgo-instance-64t9-0   4/4     Running       0          20h   10.130.3.178   w02.xxxxxxxx   <none>           <none>            replica
vdmfpgo-instance-c5k8-0   4/4     Running       0          11m   10.131.1.73    w03.xxxxxxxx   <none>           <none>            master
vdmfpgo-repo-host-0       2/2     Terminating   0          20h   10.128.2.237   w01.xxxxxxxx   <none>           <none>

At this moment, the describe of repo-host-0 was: PGOv5_3673_describe_repo-host-0_after_turnoff_w01_5min.txt The "Status" became "Terminating (lasts 8s)", and one new field "Termination Grace Period" popped up and value was "30s"

About 10 min later, checked the describe of repo-host-0 again: PGOv5_3673_describe_repo-host-0_after_turnoff_w01_10min.txt Almost same as the one after 5 min (only thing changed is the "age" and time of "lasts" of course)

About 20 min later, when DB new primary became read-write, checked the describe of repo-host-0 again: PGOv5_3673_describe_repo-host-0_after_turnoff_w01_20min.txt Almost same as the previous one (only "age" and time of "lasts" increased)

Hi @stanleyyzhu We are currently reassessing older issues to determine their relevance. To better prioritize the features and fixes required by our users for upcoming CPK v5 releases, we are identifying which CPK v5 issues, use cases, and enhancement requests remain valid, especially in the context of the latest CPK v5 release.

As we haven't received any updates on this issue for some time, we are closing it now. If you require further assistance or if this issue is still relevant to the latest CPK v5 release, please feel free to reopen this issue or ask a question in our Discord server..

For additional information about Crunchy Postgres for Kubernetes v5, including guidance for upgrading to the latest version of CPK v5, please refer to the latest documentation:

https://access.crunchydata.com/documentation/postgres-operator/latest/

CrunchyData / postgres-operator