Closed stanleyyzhu closed 6 months ago
Hello @stanleyyzhu,
Thank you for the detailed description of your scenario. One other thing that might be helpful would be a copy of your PostgresCluster manifest, particularly if you are using any customized Postgres settings, as shown in our documentation.
Thanks a lot tjmoore4 for your guide. Here is the manifest that I'm using:
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: vdmfpgo
spec:
image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-15.2-0
postgresVersion: 15
instances:
- name: instance
replicas: 3
resources:
requests:
cpu: 4.0
memory: 10Gi
limits:
cpu: 8.0
memory: 16Gi
dataVolumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 50Gi
walVolumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 5Gi
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: vdmfpgo
postgres-operator.crunchydata.com/instance-set: instance
users:
- name: hippo
databases:
- vdmfpgo
backups:
pgbackrest:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.41-4
repos:
- name: repo1
schedules:
full: "7 1 * * *"
volume:
volumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 50Gi
global:
repo1-retention-full: "3"
manual:
repoName: repo1
options:
- --type=full
restore:
enabled: false
repoName: repo1
options:
- --type=time
- --target="2021-06-09 14:15:11-04"
patroni:
dynamicConfiguration:
postgresql:
parameters:
max_connections: 500
shared_buffers: 3GB
wal_buffers: 16MB
switchover:
enabled: false
# targetInstance: vdmfpgo-instance-example
@stanleyyzhu Thank you for providing your manifest. Two separate options might be worth exploring in more detail.
Certain Postgres configuration adjustments would likely minimize the cluster's wait time. For example, setting checkpoint_timeout
might minimize the first wait you mentioned.
That said, a more robust option would be to define another pgBackRest repo in addition to your local repo-host. As described in the "Set Up Multiple Backup Repositories" section of the documentation, this would allow you to store backups in cloud based storage in addition to your local backups which I would expect to minimize your recovery time as well.
Hope this helps! Feel free to let us know how any further testing goes.
Many thanks @tjmoore4 for your suggestions
I'll test checkpoint_timeout
later today. Currently the value of checkpoint_timeout
in my test environment is the default one (5 min)
Regarding multiple repos, I actually thought about this but it might not work. Reason is that the 2 repos will be managed by the same repo-host if my understanding is correct. (please correct me if I'm wrong)
We are not allowed to use public clouds but I did a test with 2 local repos. IMHO one local PV repo + one remote repo would have the same thing:
Two repos are all managed by the same repo-host (i.e. one pod), and the local repo is a block device with "RWO", this local PVC cannot be mounted on to multiple instances.
When the worker node is down, K8S/OpenShift has no time to unmount the PVC from the old repo-host so the old repo-host will be stuck at status "TERMINATING". Consequently, K8S/OpenShift won't be able to create a new one to take over the resources for repo-host (doesn't matter the 2nd repo is local or remote).
Correct me if I'm wrong.
In the meantime, speaking of remote backup, I actually also have a side question: Based on pgbackrest, looks like SFTP is an option. However, this option is not mentioned in PGO v5 doc. Refer to STORAGE TYPE.
I'm wondering if there is any plan to have SFTP added in as an option? or if there is any reason PGO doesn't want to use SFTP? I think this could be a moderate supplementary for the environment where public cloud is not allowed.
Thanks a lot for your help
This is a quick update that I have changed checkpoint_timeout
to 1 min but the wait time is still 20 min.
The difficulty here is that log doesn't show me what is running during the 20 min so it's hard to find out the right parameter to adjust :(
vdmfpgo=> show checkpoint_timeout;
checkpoint_timeout
--------------------
1min
(1 row)
vdmfpgo=>
Logs from the new master:
2023-06-22 17:08:18.216 UTC [377] FATAL: could not receive data from WAL stream: FATAL: terminating connection due to administrator command
2023-06-22 17:08:35.978 UTC [107] LOG: restartpoint starting: time
2023-06-22 17:08:45.998 UTC [107] LOG: restartpoint complete: wrote 101 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=9.973 s, sync=0.014 s, total=10.020 s; sync files=38, longest=0.011 s, average=0.001 s; distance=784 kB, estimate=24416 kB
2023-06-22 17:08:45.998 UTC [107] LOG: recovery restart point at 20/AF0C77F8
2023-06-22 17:08:45.998 UTC [107] DETAIL: Last completed transaction was at log time 2023-06-22 17:08:17.984396+00.
2023-06-22 17:08:46.697 UTC [965] ERROR: cannot execute nextval() in a read-only transaction
2023-06-22 17:08:46.697 UTC [965] STATEMENT: select nextval ('hibernate_sequence')
...
...
2023-06-22 17:27:43.548 UTC [109] LOG: archive recovery complete
2023-06-22 17:27:43.561 UTC [803] ERROR: cannot execute INSERT in a read-only transaction
2023-06-22 17:27:43.561 UTC [803] STATEMENT: INSERT INTO syncLoopPrevent (hashcode, accepted_at) VALUES($1, $2)
2023-06-22 17:27:43.597 UTC [107] LOG: checkpoint starting: force
2023-06-22 17:27:43.603 UTC [103] LOG: database system is ready to accept connections
2023-06-22 17:27:44.483 UTC [107] LOG: checkpoint complete: wrote 10 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.827 s, sync=0.017 s, total=0.887 s; sync files=9, longest=0.011 s, average=0.002 s; distance=16222 kB, estimate=23596 kB
2023-06-22 17:28:43.499 UTC [107] LOG: checkpoint starting: time
2023-06-22 17:28:55.921 UTC [107] LOG: checkpoint complete: wrote 124 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=12.323 s, sync=0.043 s, total=12.422 s; sync files=43, longest=0.021 s, average=0.001 s; distance=865 kB, estimate=21323 kB
Hello @stanleyyzhu. Sorry to hear that your adjustments so far haven't resulted in any improvement. A couple of thoughts to consider. You mentioned that public cloud solutions are not an option, but would a self hosted S3-compatible system (like MinIO) be an option for testing? That should also allow for backups without requiring an additional PVC.
Also, we were wondering, in your original scenario, what your Pods and Nodes looked like when things started working again and your primary was back up.
Was the w03
Node back online? If so, was the repo host running on w03
again or did it migrate to another Node?
If w03
did not come back online, does the "TERMINATING" status ever resolve? Is the repo-host able to migrate to another Node?
Hi @tjmoore4 , sorry for the delay. I was tied up.
I'll deploy a MinIO server and share the test result ASAP (please bear with me as I have no experience with MinIO, not sure if it would be easy to bring up one).
And thanks a lot for introducing MinIO, I didn't know we could deploy S3-compatible system locally.
what your Pods and Nodes looked like when things started working again and your primary was back up
When things started working again, the repo-host pod was still not reachable and the primary DB was still unable to connect to the backup repo. The old master pod as well as the repo-host status were still "Terminating"
Was the w03 Node back online
No not yet. When new master DB became read-write (service recovered) after 20 min, the w03 worker node was still down, I was still able to see unable to find a valid repository
in the new master DB log. And the repo-host STATUS remained "Terminating" and no any other new repo-host pod was created (because the PVC used by this repo-host couldn't be released, OpenShift couldn't bring up another pod on another worker node in this case)
Pod allocation before shutting down a work node:
# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-hx2t-0 4/4 Running 0 5d4h 10.131.1.39 w03.xxxxxxxx <none> <none> master
vdmfpgo-instance-lcsd-0 4/4 Running 0 15m 10.130.2.73 w02.xxxxxxxx <none> <none> replica
vdmfpgo-instance-wwq8-0 4/4 Running 0 14m 10.128.2.170 w01.xxxxxxxx <none> <none> replica
vdmfpgo-repo-host-0 2/2 Running 0 5d3h 10.131.1.66 w03.xxxxxxxx <none> <none>
The master DB instance and local repo-host were assigned to w03.
Then I shut down the w03 worker node at 21:53:36 UTC (w03 work node "STATUS" is "NotReady")
# oc get node
NAME STATUS ROLES AGE VERSION
is01.xxxxxxxx Ready infra 25d v1.25.7+eab9cc9
is02.xxxxxxxx Ready infra 25d v1.25.7+eab9cc9
is03.xxxxxxxx Ready infra 25d v1.25.7+eab9cc9
m01.xxxxxxxx Ready control-plane,master 25d v1.25.7+eab9cc9
m02.xxxxxxxx Ready control-plane,master 25d v1.25.7+eab9cc9
m03.xxxxxxxx Ready control-plane,master 25d v1.25.7+eab9cc9
w01.xxxxxxxx Ready worker 25d v1.25.7+eab9cc9
w02.xxxxxxxx Ready worker 25d v1.25.7+eab9cc9
w03.xxxxxxxx NotReady worker 25d v1.25.7+eab9cc9
PGO detected the unreachable of master DB, instance "vdmfpgo-instance-wwq8-0" was promoted as a new master
# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-hx2t-0 4/4 Running 0 5d4h 10.131.1.39 w03.xxxxxxxx <none> <none> master
vdmfpgo-instance-lcsd-0 4/4 Running 0 33m 10.130.2.73 w02.xxxxxxxx <none> <none> replica
vdmfpgo-instance-wwq8-0 4/4 Running 0 33m 10.128.2.170 w01.xxxxxxxx <none> <none> master
vdmfpgo-repo-host-0 2/2 Running 0 5d4h 10.131.1.66 w03.xxxxxxxx <none> <none>
At this point, the new master DB was read-only because it tried to connect to the repo-host which was also not reachable
vdmfpgo=> SELECT pg_is_in_recovery();
pg_is_in_recovery
-------------------
t
(1 row)
About 5-6 min later (~ 21:59 UTC), OpenShift started to try to terminate the pods running on w03 worker node
# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-hx2t-0 4/4 Terminating 0 5d4h 10.131.1.39 w03.xxxxxxxx <none> <none> master
vdmfpgo-instance-lcsd-0 4/4 Running 0 35m 10.130.2.73 w02.xxxxxxxx <none> <none> replica
vdmfpgo-instance-wwq8-0 4/4 Running 0 35m 10.128.2.170 w01.xxxxxxxx <none> <none> master
vdmfpgo-repo-host-0 2/2 Terminating 0 5d4h 10.131.1.66 w03.xxxxxxxx <none> <none>
This status will remain forever (below I will post the status after waiting for 1 hour).
About 20 min later since shutting down w03 worker node (~22:14 UTC), the new master DB became read-write (i.e. recovered)
vdmfpgo=> SELECT CURRENT_TIMESTAMP;SELECT pg_is_in_recovery();
current_timestamp
-------------------------------
2023-06-27 22:14:02.735572+00
(1 row)
pg_is_in_recovery
-------------------
f
(1 row)
The pods with "STATUS=Terminating" remained. And I double checked the new master DB logs, confirmed that "unable to find a valid repository" error still popped up (to confirm that local repo-host was still down)
2023-06-27 22:17:35.413 UTC [108] LOG: checkpoint starting: time
ERROR: [103]: unable to find a valid repository:
repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-27 22:17:35.799 UTC [3398] LOG: archive command failed with exit code 103
2023-06-27 22:17:35.799 UTC [3398] DETAIL: The failed archive command was: pgbackrest --stanza=db archive-push "pg_wal/00000008.history"
2023-06-27 22:17:45.314 UTC [108] LOG: checkpoint complete: wrote 96 buffers (0.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=9.824 s, sync=0.023 s, total=9.902 s; sync files=39, longest=0.020 s, average=0.001 s; distance=16446 kB, estimate=16446 kB
Left system there for about 1 hour, checked again at about 23:17 UTC, pods with "STATUS=Terminating" were still there, and local repo-host was still down
# date; oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
Tue 27 Jun 2023 07:18:23 PM EDT
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-hx2t-0 4/4 Terminating 0 5d6h 10.131.1.39 w03.xxxxxxxx <none> <none> master
vdmfpgo-instance-lcsd-0 4/4 Running 0 113m 10.130.2.73 w02.xxxxxxxx <none> <none> replica
vdmfpgo-instance-wwq8-0 4/4 Running 0 113m 10.128.2.170 w01.xxxxxxxx <none> <none> master
vdmfpgo-repo-host-0 2/2 Terminating 0 5d5h 10.131.1.66 w03.xxxxxxxx <none> <none>
2023-06-27 23:20:35.105 UTC [108] LOG: checkpoint starting: time
2023-06-27 23:20:46.205 UTC [108] LOG: checkpoint complete: wrote 110 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=11.037 s, sync=0.018 s, total=11.101 s; sync files=42, longest=0.010 s, average=0.001 s; distance=16114 kB, estimate=16560 kB
2023-06-27 23:21:35.226 UTC [108] LOG: checkpoint starting: time
2023-06-27 23:21:45.655 UTC [108] LOG: checkpoint complete: wrote 102 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=10.336 s, sync=0.009 s, total=10.430 s; sync files=39, longest=0.008 s, average=0.001 s; distance=16354 kB, estimate=16539 kB
ERROR: [103]: unable to find a valid repository:
repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-27 23:21:54.620 UTC [3398] LOG: archive command failed with exit code 103
2023-06-27 23:21:54.620 UTC [3398] DETAIL: The failed archive command was: pgbackrest --stanza=db archive-push "pg_wal/00000008.history"
vdmfpgo=> SELECT CURRENT_TIMESTAMP;SELECT pg_is_in_recovery();
current_timestamp
------------------------------
2023-06-27 23:23:14.64303+00
(1 row)
pg_is_in_recovery
-------------------
f
(1 row)
vdmfpgo=> create table test (ok bool);
CREATE TABLE
vdmfpgo=> select * from test;
ok
----
(0 rows)
vdmfpgo=> drop table test;
DROP TABLE
vdmfpgo=>
Then I started up the worker node w03, waited less than 1 min, those "STATUS=Terminating" pods were all recreated: The old master DB was recreated on the same worker node because of "podAntiAffinity=requiredDuringSchedulingIgnoredDuringExecution" policy (I have only 3 worker nodes, each for one DB replica), The local repo-host was re-deployed on another worker node (the new one was w02)
# date; oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
Tue 27 Jun 2023 07:28:22 PM EDT
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-hx2t-0 3/4 Running 0 51s 10.131.1.80 w03.xxxxxxxx <none> <none>
vdmfpgo-instance-lcsd-0 4/4 Running 0 123m 10.130.2.73 w02.xxxxxxxx <none> <none> replica
vdmfpgo-instance-wwq8-0 4/4 Running 0 123m 10.128.2.170 w01.xxxxxxxx <none> <none> master
vdmfpgo-repo-host-0 2/2 Running 0 50s 10.130.2.149 w02.xxxxxxxx <none> <none>
And then all went back to normal.
I successfully made MinIO work with PGO today, which is not as difficult as I thought if only for testing purpose :)
Test result aligns with my previous guess
Because w03 shuts down, OpenShift doesn't have chance to unmount the local backup-repo PVC gracefully, so the repo-host-0 is stuck at status "Terminating". Therefore, OpenShift can't/won't schedule another repo-host pod to take over the job. Thus, the repo-host is not reachable.
Since both "local repo" and "MinIO repo" are all mounted to and managed by the same repo-host, the MinIO repo is also unusable though it's a remote backup-repo.
The best solution could be to have DB replicas and backup-repo-host distributed to different worker nodes. If this is not possible (which is my case), either leave it as-is (at least service will still be back after 20 min), or find the proper configure parameters to reduce the time for waiting DB recovery in the case when repo-host is not reachable.
Maybe a potentially possible solution could be: let each backup-repo be managed by individual repo-host (e.g. one repo-host pod manages local repo and another repo-host pod manages MinIO repo, and so on). But very likely this will break the whole current backup management logic, or even not doable at all.
The "backups" portion of the new manifest is attached. MinIO is added in as repo2:
backups:
pgbackrest:
image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.41-4
configuration:
- secret:
name: pgo-multi-repo-creds
repos:
- name: repo1
schedules:
full: "7 1 * * *"
volume:
volumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 50Gi
- name: repo2
s3:
bucket: vdmf-minio
endpoint: 172.21.103.21:9000
region: srbhpp_minio
global:
repo1-retention-full: "3"
repo2-s3-uri-style: path
repo2-path: /pgbackrest/vdmfpgo/hippo-multi-repo/repo2
repo2-storage-verify-tls: n
manual:
repoName: repo2
options:
- --type=full
restore:
enabled: false
repoName: repo1
options:
- --type=time
- --target="2021-06-09 14:15:11-04"
oc describe postgrescluster/vdmfpgo
showed me that "Stanza Created = true" for both of the 2 repos (local + MinIO)
# oc describe postgrescluster/vdmfpgo
Name: vdmfpgo
...
...
Status:
Conditions:
Last Transition Time: 2023-06-28T20:36:01Z
Message: pgBackRest dedicated repository host is ready
Observed Generation: 1
Reason: RepoHostReady
Status: True
Type: PGBackRestRepoHostReady
Last Transition Time: 2023-06-28T20:35:57Z
Message: pgBackRest replica create repo is ready for backups
Observed Generation: 1
Reason: StanzaCreated
Status: True
Type: PGBackRestReplicaRepoReady
Last Transition Time: 2023-06-28T20:39:07Z
Message: pgBackRest replica creation is now possible
Observed Generation: 1
Reason: RepoBackupComplete
Status: True
Type: PGBackRestReplicaCreate
Last Transition Time: 2023-06-28T20:40:39Z
Message: Manual backup completed successfully
Observed Generation: 1
Reason: ManualBackupComplete
Status: True
Type: PGBackRestManualBackupSuccessful
Database Revision: 69998879ff
Instances:
Name: instance
Ready Replicas: 3
Replicas: 3
Updated Replicas: 3
Monitoring:
Exporter Configuration: 559c4c97d6
Observed Generation: 1
Patroni:
System Identifier: 7249838429033578591
Pgbackrest:
Manual Backup:
Completion Time: 2023-06-28T20:40:37Z
Finished: true
Id: Wed 28 Jun 2023 04:40:17 PM EDT
Start Time: 2023-06-28T20:40:18Z
Succeeded: 1
Repo Host:
API Version: apps/v1
Kind: StatefulSet
Ready: true
Repos:
Bound: true
Name: repo1
Replica Create Backup Complete: true
Stanza Created: true
Volume: pvc-c08a3c08-779f-4d48-b258-ae050d8470c2
Name: repo2
Repo Options Hash: 586ff7cb44
Stanza Created: true
Proxy:
Pg Bouncer:
Postgres Revision: 5c9966f6bc
Users Revision: 7645b7fc86
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal RepoHostCreated 5m24s postgrescluster-controller created pgBackRest repository host StatefulSet/vdmfpgo-repo-host
Normal StanzasCreated 5m4s (x2 over 5m7s) postgrescluster-controller pgBackRest stanza creation completed successfully
Ran a manual backup then checked MinIO, the backup was there. i.e. integration was good.
OK. Now the environment is ready for testing. Both "local repo" and "MinIO repo" are all managed by the same repo-host-0 pod.
pods allocation and status before shutting down worker node:
# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-m55r-0 4/4 Running 0 7m22s 10.128.2.213 w01.xxxxxxxx <none> <none> replica
vdmfpgo-instance-tgk8-0 4/4 Running 0 7m22s 10.130.3.169 w02.xxxxxxxx <none> <none> replica
vdmfpgo-instance-z28k-0 4/4 Running 0 7m22s 10.131.0.83 w03.xxxxxxxx <none> <none> master
vdmfpgo-repo-host-0 2/2 Running 0 7m21s 10.131.0.84 w03.xxxxxxxx <none> <none>
The master DB and repo-host-0 were all running on worker node "w03"
Shut down worker node "w03" at 20:44:39 UTC
-sh-4.2$ openstack server stop w03 && date
Wed Jun 28 16:44:39 EDT 2023
-sh-4.2$
PGO detected the unreachable status and promoted "vdmfpgo-instance-tgk8-0" as the new master
# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-m55r-0 4/4 Running 0 9m38s 10.128.2.213 w01.xxxxxxxx <none> <none> replica
vdmfpgo-instance-tgk8-0 4/4 Running 0 9m38s 10.130.3.169 w02.xxxxxxxx <none> <none> master
vdmfpgo-instance-z28k-0 4/4 Running 0 9m38s 10.131.0.83 w03.xxxxxxxx <none> <none> master
vdmfpgo-repo-host-0 2/2 Running 0 9m37s 10.131.0.84 w03.xxxxxxxx <none> <none>
Checked the new primary DB status. It was still in "recovery" mode
vdmfpgo=> SELECT CURRENT_TIMESTAMP;SELECT pg_is_in_recovery();
current_timestamp
-------------------------------
2023-06-28 20:47:10.530067+00
(1 row)
pg_is_in_recovery
-------------------
t
(1 row)
Checked the logs on the new primary server, connecting to repo-host-0 timeout (as w03 was down)
2023-06-28 20:42:48.546 UTC [200] LOG: restartpoint complete: wrote 85 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=8.392 s, sync=0.015 s, total=8.455 s; sync files=37, longest=0.011 s, average=0.001 s; distance=98304 kB, estimate=98304 kB
2023-06-28 20:42:48.546 UTC [200] LOG: recovery restart point at 0/A000028
2023-06-28 20:42:48.546 UTC [200] DETAIL: Last completed transaction was at log time 2023-06-28 20:37:30.045233+00.
2023-06-28 20:44:39.208 UTC [215] FATAL: could not receive data from WAL stream: FATAL: terminating connection due to administrator command
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
About 5 min later, OpenShift started to terminate those pods on w03, including the old primary DB and repo-host-0
# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide; date
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-m55r-0 4/4 Running 0 14m 10.128.2.213 w01.xxxxxxxx <none> <none> replica
vdmfpgo-instance-tgk8-0 4/4 Running 0 14m 10.130.3.169 w02.xxxxxxxx <none> <none> master
vdmfpgo-instance-z28k-0 4/4 Terminating 0 14m 10.131.0.83 w03.xxxxxxxx <none> <none> master
vdmfpgo-repo-host-0 2/2 Terminating 0 14m 10.131.0.84 w03.xxxxxxxx <none> <none>
Because the repo-host-0 cannot be terminated smoothly (as the local PVC cannot be unmounted successfully), there was no any new repo-host pod created.
So the repos are still unreachable (both "local repo" and "MinIO repo" are all managed by the same repo-host. When repo-host is not reachable, the 2 repos are not usable)
Check pod status, DB status and primary DB logs at around 21:00 UTC (about 15 min after w03 down), which shows that the DB is still under read-only (recovery) status
# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide; date
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-m55r-0 4/4 Running 0 24m 10.128.2.213 w01.xxxxxxxx <none> <none> replica
vdmfpgo-instance-tgk8-0 4/4 Running 0 24m 10.130.3.169 w02.xxxxxxxx <none> <none> master
vdmfpgo-instance-z28k-0 4/4 Terminating 0 24m 10.131.0.83 w03.xxxxxxxx <none> <none> master
vdmfpgo-repo-host-0 2/2 Terminating 0 24m 10.131.0.84 w03.xxxxxxxx <none> <none>
vdmfpgo=> SELECT CURRENT_TIMESTAMP;SELECT pg_is_in_recovery();
current_timestamp
-------------------------------
2023-06-28 21:00:41.222543+00
(1 row)
pg_is_in_recovery
-------------------
t
(1 row)
bash-4.4$ tail -f /pgdata/pg15/log/postgresql-Wed.log
2023-06-28 20:37:40.922 UTC [202] LOG: restored log file "000000010000000000000005" from archive
2023-06-28 20:37:41.187 UTC [202] LOG: restored log file "000000010000000000000006" from archive
2023-06-28 20:37:41.400 UTC [202] LOG: consistent recovery state reached at 0/605E828
2023-06-28 20:37:41.400 UTC [197] LOG: database system is ready to accept read-only connections
2023-06-28 20:37:41.442 UTC [215] LOG: started streaming WAL from primary at 0/7000000 on timeline 1
2023-06-28 20:42:40.092 UTC [200] LOG: restartpoint starting: time
2023-06-28 20:42:48.546 UTC [200] LOG: restartpoint complete: wrote 85 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=8.392 s, sync=0.015 s, total=8.455 s; sync files=37, longest=0.011 s, average=0.001 s; distance=98304 kB, estimate=98304 kB
2023-06-28 20:42:48.546 UTC [200] LOG: recovery restart point at 0/A000028
2023-06-28 20:42:48.546 UTC [200] DETAIL: Last completed transaction was at log time 2023-06-28 20:37:30.045233+00.
2023-06-28 20:44:39.208 UTC [215] FATAL: could not receive data from WAL stream: FATAL: terminating connection due to administrator command
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-28 20:52:31.334 UTC [202] LOG: invalid record length at 0/B000060: wanted 24, got 0
2023-06-28 20:52:31.334 UTC [202] LOG: received promote request
2023-06-28 20:52:31.334 UTC [202] LOG: redo done at 0/B000028 system usage: CPU: user: 0.00 s, system: 0.02 s, elapsed: 890.52 s
2023-06-28 20:52:31.334 UTC [202] LOG: last completed transaction was at log time 2023-06-28 20:37:30.045233+00
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-28 21:00:32.362 UTC [202] LOG: selected new timeline ID: 2
About 20 min later, checked pod status, DB status and new primary DB logs again. The primary DB forced back to read-write mode, DB writing function became normal. PODs of the old primary DB and repo-host-0 still remained in "Terminating" status, and the new primary DB log still showed "timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'"
# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide; date
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-m55r-0 4/4 Running 0 33m 10.128.2.213 w01.xxxxxxxx <none> <none> replica
vdmfpgo-instance-tgk8-0 4/4 Running 0 33m 10.130.3.169 w02.xxxxxxxx <none> <none> master
vdmfpgo-instance-z28k-0 4/4 Terminating 0 33m 10.131.0.83 w03.xxxxxxxx <none> <none> master
vdmfpgo-repo-host-0 2/2 Terminating 0 33m 10.131.0.84 w03.xxxxxxxx <none> <none>
vdmfpgo=> SELECT CURRENT_TIMESTAMP;SELECT pg_is_in_recovery();
current_timestamp
-------------------------------
2023-06-28 21:09:25.320753+00
(1 row)
pg_is_in_recovery
-------------------
f
(1 row)
vdmfpgo=> create table test (f1 bool);
CREATE TABLE
2023-06-28 21:00:32.362 UTC [202] LOG: selected new timeline ID: 2
WARN: repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-28 21:04:33.434 UTC [202] LOG: archive recovery complete
2023-06-28 21:04:33.499 UTC [200] LOG: checkpoint starting: force
2023-06-28 21:04:33.505 UTC [197] LOG: database system is ready to accept connections
2023-06-28 21:04:33.595 UTC [200] LOG: checkpoint complete: wrote 2 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.023 s, sync=0.008 s, total=0.096 s; sync files=2, longest=0.004 s, average=0.004 s; distance=16384 kB, estimate=90112 kB
ERROR: [104]: archive-push command encountered error(s):
repo1: [HostConnectError] timeout connecting to 'vdmfpgo-repo-host-0.vdmfpgo-pods.vdmf-system.svc.cluster.local.:8432'
2023-06-28 21:08:34.028 UTC [1875] LOG: archive command failed with exit code 104
2023-06-28 21:08:34.028 UTC [1875] DETAIL: The failed archive command was: pgbackrest --stanza=db archive-push "pg_wal/00000002.history"
2023-06-28 21:09:33.895 UTC [200] LOG: checkpoint starting: time
2023-06-28 21:09:34.255 UTC [200] LOG: checkpoint complete: wrote 1 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.107 s, sync=0.029 s, total=0.360 s; sync files=1, longest=0.029 s, average=0.029 s; distance=16383 kB, estimate=82739 kB
2023-06-28 21:09:40.980 UTC [899] STATEMENT: create table test (f1 bool);
Then I started w03 worker node and service were all recovered. repo-host-0 was re-scheduled to w01 worker node
@stanleyyzhu Thank you for the detailed information, this has been very helpful in understanding your scenario. We've done a bit of testing, but we haven't been able to replicate your exact scenario. One thing we were curious about was the describe output from your repo-host Pod after you shut-down your w03 node, i.e. kubectl describe pod <repo host pod name>
.
Thank you very much @tjmoore4 and PGO team,
Below are the describe outputs of repo-host-0 at different stages.
As you mentioned you couldn't replicate the scenario, wondering if your repo-host was stuck at "Terminating" status? Basically my OpenShift environment uses "ocs-storagecluster-ceph-rbd" as storageclass and the accessmode is "ReadWriteOnce". The "AccessMode" in the PGO manifest is "ReadWriteOnce" by default. If your team couldn't see "Terminating" status hanging there, not sure if your environment used filesystem or block device as storage (I kinda recall cephfs behavior is quite different from ceph-rbd)
BTW, it might be unfair to let you and your team spends too much time on this (as this is the community version). I'll be more than happy to continue to test and return info if your team is interested in this scenario. But if your team doesn't have much time, It's totally understandable and I'm OK as well (at least service can still be back after about 20 min)
# oc get pvc | grep vdmfpgo
vdmfpgo-instance-4dqs-pgdata Bound pvc-03743f0b-ad81-4663-b51e-e62607158b94 50Gi RWO ocs-storagecluster-ceph-rbd 21h
vdmfpgo-instance-4dqs-pgwal Bound pvc-0247f287-5cbc-4955-af02-d331897e3cbe 5Gi RWO ocs-storagecluster-ceph-rbd 21h
vdmfpgo-instance-64t9-pgdata Bound pvc-d5d6c501-cf46-4b35-a852-8fb589660c88 50Gi RWO ocs-storagecluster-ceph-rbd 21h
vdmfpgo-instance-64t9-pgwal Bound pvc-e5e64d31-ac33-4315-a462-a249fcd5d910 5Gi RWO ocs-storagecluster-ceph-rbd 21h
vdmfpgo-instance-c5k8-pgdata Bound pvc-d08d3f3a-63d8-41de-aa1d-a496ff300e14 50Gi RWO ocs-storagecluster-ceph-rbd 21h
vdmfpgo-instance-c5k8-pgwal Bound pvc-6a715a9b-8461-4e5e-95dc-a905ffa7d838 5Gi RWO ocs-storagecluster-ceph-rbd 21h
vdmfpgo-repo1 Bound pvc-a956c736-3e24-4de0-b118-d601bfa4bfbc 50Gi RWO ocs-storagecluster-ceph-rbd 21h
# oc get pvc/vdmfpgo-repo1 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
annotations:
pv.kubernetes.io/bind-completed: "yes"
pv.kubernetes.io/bound-by-controller: "yes"
volume.beta.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
volume.kubernetes.io/storage-provisioner: openshift-storage.rbd.csi.ceph.com
creationTimestamp: "2023-06-28T21:56:29Z"
finalizers:
- kubernetes.io/pvc-protection
labels:
postgres-operator.crunchydata.com/cluster: vdmfpgo
postgres-operator.crunchydata.com/data: pgbackrest
postgres-operator.crunchydata.com/pgbackrest: ""
postgres-operator.crunchydata.com/pgbackrest-repo: repo1
postgres-operator.crunchydata.com/pgbackrest-volume: ""
name: vdmfpgo-repo1
namespace: vdmf-system
ownerReferences:
- apiVersion: postgres-operator.crunchydata.com/v1beta1
blockOwnerDeletion: true
controller: true
kind: PostgresCluster
name: vdmfpgo
uid: ab3d43fe-bb43-4157-8039-1a1b811861ba
resourceVersion: "22200322"
uid: a956c736-3e24-4de0-b118-d601bfa4bfbc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: ocs-storagecluster-ceph-rbd
volumeMode: Filesystem
volumeName: pvc-a956c736-3e24-4de0-b118-d601bfa4bfbc
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 50Gi
phase: Bound
After shutting down the worker node, repo-host-0 popped up a warning event "NodeNotReady" and "Conditions.Ready" became "False" This status remained for 5 min until openshift tried to terminate the pod, then the "Status" in the repo-host-0 describe turned to "Terminating" then no more change
Before I shutdown any worker node, master DB pod and repo-host pod are running on w01
# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-4dqs-0 4/4 Running 0 20h 10.128.2.230 w01.xxxxxxxx <none> <none> master
vdmfpgo-instance-64t9-0 4/4 Running 0 20h 10.130.3.178 w02.xxxxxxxx <none> <none> replica
vdmfpgo-instance-c5k8-0 4/4 Running 0 3m32s 10.131.0.105 w03.xxxxxxxx <none> <none> replica
vdmfpgo-repo-host-0 2/2 Running 0 20h 10.128.2.237 w01.xxxxxxxx <none> <none>
At this moment, the describe output of repo-host is: PGOv5_3673_describe_repo-host-0_b4_turnoff_w01.txt
Then I shut down w01, wait for about 30 sec, the describe output of repo-host is: PGOv5_3673_describe_repo-host-0_after_turnoff_w01_30sec.txt At this moment, "Conditions.Ready" turned to "false", and a warning event "NodeNotReady" popped up
Then I checked the describe again about 4 min later (before OpenShift terminating the repo-host-0): Same as the previous describe PGOv5_3673_describe_repo-host-0_after_turnoff_w01_4min.txt
About 5-6 min later, OpenShift changed the repo-host-0 status to "Terminating",
# oc get pod -l postgres-operator.crunchydata.com/cluster=vdmfpgo -L postgres-operator.crunchydata.com/role -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES ROLE
vdmfpgo-instance-4dqs-0 4/4 Terminating 0 20h 10.128.2.230 w01.xxxxxxxx <none> <none> master
vdmfpgo-instance-64t9-0 4/4 Running 0 20h 10.130.3.178 w02.xxxxxxxx <none> <none> replica
vdmfpgo-instance-c5k8-0 4/4 Running 0 11m 10.131.1.73 w03.xxxxxxxx <none> <none> master
vdmfpgo-repo-host-0 2/2 Terminating 0 20h 10.128.2.237 w01.xxxxxxxx <none> <none>
At this moment, the describe of repo-host-0 was: PGOv5_3673_describe_repo-host-0_after_turnoff_w01_5min.txt The "Status" became "Terminating (lasts 8s)", and one new field "Termination Grace Period" popped up and value was "30s"
About 10 min later, checked the describe of repo-host-0 again: PGOv5_3673_describe_repo-host-0_after_turnoff_w01_10min.txt Almost same as the one after 5 min (only thing changed is the "age" and time of "lasts" of course)
About 20 min later, when DB new primary became read-write, checked the describe of repo-host-0 again: PGOv5_3673_describe_repo-host-0_after_turnoff_w01_20min.txt Almost same as the previous one (only "age" and time of "lasts" increased)
Hi @stanleyyzhu We are currently reassessing older issues to determine their relevance. To better prioritize the features and fixes required by our users for upcoming CPK v5 releases, we are identifying which CPK v5 issues, use cases, and enhancement requests remain valid, especially in the context of the latest CPK v5 release.
As we haven't received any updates on this issue for some time, we are closing it now. If you require further assistance or if this issue is still relevant to the latest CPK v5 release, please feel free to reopen this issue or ask a question in our Discord server..
For additional information about Crunchy Postgres for Kubernetes v5, including guidance for upgrading to the latest version of CPK v5, please refer to the latest documentation:
https://access.crunchydata.com/documentation/postgres-operator/latest/
Problem description
PGO v5.3.1, three worker nodes run with 3 postgres DB pods + 1 local backup repository so the local backup repository and one of the 3 DB pods must be running on the same worker node
When I shutdown the worker node where the master Postgres DB is running (to simulate power outage or network issue), if the backup repo pod also runs on this worker node, then the backup repo will be out of service also. In this situation, the new promoted mater will stay in recovery mode for 20 min.
My question is whether this 20 min can be reduced?
I understand that the best solution is to have one more worker node so postgres DB pods and backup repo pod can be deployed to independent worker nodes then shutting down the worker node with master DB won't impact backup repo. However, this is not doable in my environment. So the alternative could be: whether I can control the timer or threshold to let the new promoted master DB pod quit earlier from the state of waiting for reconnecting to backup repo?
Environment
Steps to reproduce the issue
PostgreSQL DB has 3 replicas + 1 local backup repo running on 3 worker nodes
The master DB (jltz) is running on w03, which has backup repo (vdmfpgo-repo-host-0) running on as well
Shut down the w03 worker node at 14:11:31 UTC, wait a bit and check pods status again:
PODs "STATUS" has not been changed yet (it will take about 5 min, which is normal) but the old DB master pod (jltz) and backup repo (vdmfpgo-repo-host-0) are all unavailable at this point. The new promoted master is vdmfpgo-instance-kpmd-0
Log in to the new master pod (kpmd), the "pg_is_in_recovery()" returns t (i.e. the DB is in recovery)
Kept checking the state of the new prompted DB, "pg_is_in_recovery()" turns to "f" after about 20 min.
Check the logs on the new master DB pod (kpmd): 14:11:45 UTC: w03 worker node was shut down, so the connection to primary DB was terminated
Then this pod received promote request
However, from DB log, I can see that the DB on this pod actually was trying to find a valid backup repository
Until 14:31:39, the new master pod completed the archive recovery (looks like a timeout and no more waiting for the valid backup repo) and then DB on this pod was changed to "read-write" mode
If the backup repo is online (i.e. backup repo run on different worker node from the old master db), then db failover is fairly quick.