Closed Eric-zch closed 1 year ago
Hi @Eric-zch,
Thanks for reporting this issue. I've created a ticket.
@Eric-zch sorry to hear you are having trouble!
This appears to be similar following pgBackRest thread for the same error, which describes time sync and/or filesystem issues as potential culprits: https://github.com/pgbackrest/pgbackrest/issues/1505.
Can you provide more insight into your cluster configuration, specifically as relates to storage? For instance, the specific type of file system being used for both PostgreSQL and the pgBackRest repository?
Hi @andrewlecuyer Below is my postgrescluster configuration.
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: pnst
spec:
(ignore)
postgresVersion: 14
openshift: true
instances:
- name: instance1
replicas: 2
minAvailable: 1
dataVolumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 20Gi
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 1
podAffinityTerm:
topologyKey: kubernetes.io/hostname
labelSelector:
matchLabels:
postgres-operator.crunchydata.com/cluster: pnst
postgres-operator.crunchydata.com/instance-set: instance1
backups:
pgbackrest:
global:
repo1-retention-full: "30"
repo1-retention-full-type: time
delta: "y"
repos:
- name: repo1
schedules:
full: "0 0 * * 6"
incremental: "0 1 * * *"
volume:
volumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 20Gi
(ignore)
Storage
[devops@ ~]$ oc get sc
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
localblock kubernetes.io/no-provisioner Delete WaitForFirstConsumer false 61d
ocs-storagecluster-ceph-rbd openshift-storage.rbd.csi.ceph.com Delete Immediate true 61d
ocs-storagecluster-ceph-rgw openshift-storage.ceph.rook.io/bucket Delete Immediate false 61d
ocs-storagecluster-cephfs openshift-storage.cephfs.csi.ceph.com Delete Immediate true 61d
ocs-storagecluster-cephfs-retain (default) openshift-storage.cephfs.csi.ceph.com Retain Immediate true 10d
openshift-storage.noobaa.io openshift-storage.noobaa.io/obc Delete Immediate false 61d
[devops@ ~]$ oc get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-6efcb467-2b09-41d0-adaa-5fff3fab28a6 20Gi RWX Retain Bound dev-pnst/pnst-repo1 ocs-storagecluster-cephfs 60d
pvc-bf8620a2-5aa2-4f3a-9934-53e523facc75 20Gi RWO Retain Bound dev-pnst/pnst-instance1-g2g9-pgdata ocs-storagecluster-cephfs 12d
pvc-d6a9a4b0-6e00-4f03-b22d-41f2ca151941 20Gi RWO Retain Bound dev-pnst/pnst-instance1-8xqx-pgdata ocs-storagecluster-cephfs-retain 105m
[devops@ ~]$ oc get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
pnst-instance1-8xqx-pgdata Bound pvc-d6a9a4b0-6e00-4f03-b22d-41f2ca151941 20Gi RWO ocs-storagecluster-cephfs-retain 105m
pnst-instance1-g2g9-pgdata Bound pvc-bf8620a2-5aa2-4f3a-9934-53e523facc75 20Gi RWO ocs-storagecluster-cephfs 12d
pnst-repo1 Bound pvc-6efcb467-2b09-41d0-adaa-5fff3fab28a6 20Gi RWX ocs-storagecluster-cephfs 60d
Hi @Eric-zch, Thanks for sharing this information. One similarity between your issue and the pgBackRest issue that Andrew linked is Ceph storage. We have some internal experience of performance issues with pgBackRest backups and cephfs. In some cases, an initial backup of a 35MB database was found to take over 3 hours. We would like to continue to dig into what might be the cause of this issue, so we have a few more questions:
Closing this issue as stale, but please feel free to reopen with the additional troubleshooting information requested and we'll be happy to take another look.
Overview
Sometimes there is an error with scheduled pgbackrest backup.
Environment
Error messages from the backup Pod.
time="2023-05-09T02:00:04Z" level=info msg="crunchy-pgbackrest starts" time="2023-05-09T02:00:04Z" level=info msg="debug flag set to false" time="2023-05-09T02:00:04Z" level=info msg="backrest backup command requested" time="2023-05-09T02:00:04Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=incr]" time="2023-05-09T02:00:16Z" level=info msg="output=[]" time="2023-05-09T02:00:16Z" level=info msg="stderr=[ERROR: [055]: pg_control must be present in all online backups\n HINT: is something wrong with the clock or filesystem timestamps?\n]" time="2023-05-09T02:00:16Z" level=fatal msg="command terminated with exit code 55"