CrunchyData / postgres-operator

Production PostgreSQL for Kubernetes, from high availability Postgres clusters to full-scale database-as-a-service.
https://access.crunchydata.com/documentation/postgres-operator/v5/
Apache License 2.0
3.89k stars 587 forks source link

Error: pg_control must be present in all online backups #3648

Closed Eric-zch closed 1 year ago

Eric-zch commented 1 year ago

Overview

Sometimes there is an error with scheduled pgbackrest backup.

Environment

Error messages from the backup Pod.

time="2023-05-09T02:00:04Z" level=info msg="crunchy-pgbackrest starts" time="2023-05-09T02:00:04Z" level=info msg="debug flag set to false" time="2023-05-09T02:00:04Z" level=info msg="backrest backup command requested" time="2023-05-09T02:00:04Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=incr]" time="2023-05-09T02:00:16Z" level=info msg="output=[]" time="2023-05-09T02:00:16Z" level=info msg="stderr=[ERROR: [055]: pg_control must be present in all online backups\n HINT: is something wrong with the clock or filesystem timestamps?\n]" time="2023-05-09T02:00:16Z" level=fatal msg="command terminated with exit code 55"

ValClarkson commented 1 year ago

Hi @Eric-zch,

Thanks for reporting this issue. I've created a ticket.

andrewlecuyer commented 1 year ago

@Eric-zch sorry to hear you are having trouble!

This appears to be similar following pgBackRest thread for the same error, which describes time sync and/or filesystem issues as potential culprits: https://github.com/pgbackrest/pgbackrest/issues/1505.

Can you provide more insight into your cluster configuration, specifically as relates to storage? For instance, the specific type of file system being used for both PostgreSQL and the pgBackRest repository?

Eric-zch commented 1 year ago

Hi @andrewlecuyer Below is my postgrescluster configuration.

apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: pnst
spec:
  (ignore)
  postgresVersion: 14
  openshift: true
  instances:
    - name: instance1
      replicas: 2
      minAvailable: 1
      dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 20Gi
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              podAffinityTerm:
                topologyKey: kubernetes.io/hostname
                labelSelector:
                  matchLabels:
                    postgres-operator.crunchydata.com/cluster: pnst
                    postgres-operator.crunchydata.com/instance-set: instance1
  backups:
    pgbackrest:
      global:
        repo1-retention-full: "30"
        repo1-retention-full-type: time
        delta: "y"
      repos:
      - name: repo1
        schedules:
          full: "0 0 * * 6"
          incremental: "0 1 * * *"
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 20Gi
  (ignore)

Storage

[devops@ ~]$ oc get sc
NAME                                         PROVISIONER                             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE
localblock                                   kubernetes.io/no-provisioner            Delete          WaitForFirstConsumer   false                  61d
ocs-storagecluster-ceph-rbd                  openshift-storage.rbd.csi.ceph.com      Delete          Immediate              true                   61d
ocs-storagecluster-ceph-rgw                  openshift-storage.ceph.rook.io/bucket   Delete          Immediate              false                  61d
ocs-storagecluster-cephfs                    openshift-storage.cephfs.csi.ceph.com   Delete          Immediate              true                   61d
ocs-storagecluster-cephfs-retain (default)   openshift-storage.cephfs.csi.ceph.com   Retain          Immediate              true                   10d
openshift-storage.noobaa.io                  openshift-storage.noobaa.io/obc         Delete          Immediate              false                  61d
[devops@ ~]$ oc get pv
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                                        STORAGECLASS                       REASON   AGE
pvc-6efcb467-2b09-41d0-adaa-5fff3fab28a6   20Gi       RWX            Retain           Bound      dev-pnst/pnst-repo1                                          ocs-storagecluster-cephfs                   60d
pvc-bf8620a2-5aa2-4f3a-9934-53e523facc75   20Gi       RWO            Retain           Bound      dev-pnst/pnst-instance1-g2g9-pgdata                          ocs-storagecluster-cephfs                   12d
pvc-d6a9a4b0-6e00-4f03-b22d-41f2ca151941   20Gi       RWO            Retain           Bound      dev-pnst/pnst-instance1-8xqx-pgdata                          ocs-storagecluster-cephfs-retain            105m
[devops@ ~]$ oc get pvc
NAME                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                       AGE
pnst-instance1-8xqx-pgdata   Bound    pvc-d6a9a4b0-6e00-4f03-b22d-41f2ca151941   20Gi       RWO            ocs-storagecluster-cephfs-retain   105m
pnst-instance1-g2g9-pgdata   Bound    pvc-bf8620a2-5aa2-4f3a-9934-53e523facc75   20Gi       RWO            ocs-storagecluster-cephfs          12d
pnst-repo1                   Bound    pvc-6efcb467-2b09-41d0-adaa-5fff3fab28a6   20Gi       RWX            ocs-storagecluster-cephfs          60d
jmckulk commented 1 year ago

Hi @Eric-zch, Thanks for sharing this information. One similarity between your issue and the pgBackRest issue that Andrew linked is Ceph storage. We have some internal experience of performance issues with pgBackRest backups and cephfs. In some cases, an initial backup of a 35MB database was found to take over 3 hours. We would like to continue to dig into what might be the cause of this issue, so we have a few more questions:

tjmoore4 commented 1 year ago

Closing this issue as stale, but please feel free to reopen with the additional troubleshooting information requested and we'll be happy to take another look.