v5.4.0 | Backup Fails | Replicas out of sync | WAL segment 000... was not archived before the 60000ms timeout

vit-h commented 1 year ago

Overview

Please help me to enable Partial and Full backups and syncronize replicas. Not sure It is a bug or my misconfig. This is my demo env and I am working on the PROD setup now. This situation scares me about going with PGO in PROD, I spent a week already with chat GPT trying to knok this issue down. I trusted a conference video that it is easy, but with new versions it appears new chalanges and lack of docs, stack overflow questions and community make it so hard to solve without asking here.

Ready to jump on the call and screenshare (Discord: vit_h)

Environment

Please provide the following details:

Platform: AKS
Platform Version: K8s: 1.25.11
PGO Image Tag: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-gis:ubi8-15.3-3.3-2
Postgres Version 15
Storage: S3 and PVC
pgo version -- Client Version: v0.3.0 -- Operator Version: v5.4.0

Logs and console

Explore-logs-2023-09-19 22_02_43.txt Explore-logs-2023-09-19 22_02_49.json.txt

get pods -n pgo NAME READY STATUS RESTARTS AGE itfdb-backup-7nj8-2f69w 0/1 Error 0 41m itfdb-backup-7nj8-4w7xw 0/1 Error 0 36m itfdb-backup-7nj8-57k66 0/1 Error 0 40m itfdb-backup-7nj8-cpl9d 0/1 Error 0 32m itfdb-backup-7nj8-d5glq 0/1 Error 0 44m itfdb-backup-7nj8-fvwsp 0/1 Error 0 43m itfdb-backup-7nj8-zhs76 0/1 Error 0 38m itfdb-i1-9hjz-0 4/4 Running 0 21h itfdb-i1-hnvh-0 4/4 Running 2 (22h ago) 23h itfdb-repo-host-0 2/2 Running 0 2d1h itfdb-repo1-diff-28252860-5hf8r 0/1 Error 0 71m itfdb-repo1-diff-28252860-66xn4 0/1 Error 0 72m itfdb-repo1-diff-28252860-6mms7 0/1 Error 0 61m itfdb-repo1-diff-28252860-h7dst 0/1 Error 0 67m itfdb-repo1-diff-28252860-lzb4x 0/1 Error 0 65m itfdb-repo1-diff-28252860-pwt2n 0/1 Error 0 69m itfdb-repo1-diff-28252860-t4bbd 0/1 Error 0 73m itfdb-repo2-diff-28252860-99tvl 0/1 Error 0 73m itfdb-repo2-diff-28252860-bgkdj 0/1 Error 0 70m itfdb-repo2-diff-28252860-bkdz9 0/1 Error 0 72m itfdb-repo2-diff-28252860-bnwml 0/1 Error 0 67m itfdb-repo2-diff-28252860-d6h9l 0/1 Error 0 72m itfdb-repo2-diff-28252860-d8bxd 0/1 Error 0 73m itfdb-repo2-diff-28252860-xwmgj 0/1 Error 0 73m pgo-7769bc568-9v7tr 1/1 Running 0 21h

kubectl logs itfdb-repo1-diff-28252860-66xn4 -n pgo time="2023-09-20T01:01:09Z" level=info msg="crunchy-pgbackrest starts" time="2023-09-20T01:01:09Z" level=info msg="debug flag set to false" time="2023-09-20T01:01:09Z" level=info msg="backrest backup command requested" time="2023-09-20T01:01:09Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=diff]" time="2023-09-20T01:02:12Z" level=info msg="output=[]" time="2023-09-20T01:02:12Z" level=info msg="stderr=[ERROR: [082]: WAL segment 00000013000000CA00000026 was not archived before the 60000ms timeout\n HINT: check the archive_command to ensure that all options are correct (especially --stanza).\n HINT: check the PostgreSQL server log for errors.\n HINT: run the 'start' command if the stanza was previously stopped.\n]" time="2023-09-20T01:02:12Z" level=fatal msg="command terminated with exit code 82"

kubectl logs itfdb-backup-7nj8-2f69w -n pgo time="2023-09-20T01:31:42Z" level=info msg="crunchy-pgbackrest starts" time="2023-09-20T01:31:42Z" level=info msg="debug flag set to false" time="2023-09-20T01:31:42Z" level=info msg="backrest backup command requested" time="2023-09-20T01:31:42Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=full]" time="2023-09-20T01:32:45Z" level=info msg="output=[]" time="2023-09-20T01:32:45Z" level=info msg="stderr=[ERROR: [082]: WAL segment 00000013000000CA00000049 was not archived before the 60000ms timeout\n HINT: check the archive_command to ensure that all options are correct (especially --stanza).\n HINT: check the PostgreSQL server log for errors.\n HINT: run the 'start' command if the stanza was previously stopped.\n]" time="2023-09-20T01:32:45Z" level=fatal msg="command terminated with exit code 82"

bash-4.4$ patronictl list

Cluster: itfdb-ha (7267015936833986633) -----+---------+---------+----+-----------+ | Member | Host | Role | State | TL | Lag in MB | +-----------------+----------------------------+---------+---------+----+-----------+ | itfdb-i1-9hjz-0 | itfdb-i1-9hjz-0.itfdb-pods | Replica | running | 7 | 43104 | | itfdb-i1-hnvh-0 | itfdb-i1-hnvh-0.itfdb-pods | Leader | running | 19 | | +-----------------+----------------------------+---------+---------+----+-----------+

itfdb-i1-hnvh-0 accumuleted 50 GB of WAL files so I had to kill it and failover to the itfdb-i1-9hjz-0. Then scaled down to one instance.

Then I started getting this error when trying to make a backup kubectl logs itfdb-backup-x9tg-wt9vx -n pgo time="2023-09-20T02:23:08Z" level=info msg="crunchy-pgbackrest starts" time="2023-09-20T02:23:08Z" level=info msg="debug flag set to false" time="2023-09-20T02:23:08Z" level=info msg="backrest backup command requested" time="2023-09-20T02:23:08Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=full]" time="2023-09-20T02:24:11Z" level=info msg="output=[]" time="2023-09-20T02:24:11Z" level=info msg="stderr=[WARN: unable to check pg2: [DbConnectError] raised from remote-0 tls protocol on 'itfdb-i1-hnvh-0.itfdb-pods.pgo.svc.cluster.local.': unable to connect to 'dbname='postgres' port=5432 host='/tmp/postgres'': connection to server on socket \"/tmp/postgres/.s.PGSQL.5432\" failed: No such file or directory\n \tIs the server running locally and accepting connections on that socket?\nERROR: [082]: WAL segment 00000016000000BF000000F2 was not archived before the 60000ms timeout\n HINT: check the archive_command to ensure that all options are correct (especially --stanza).\n HINT: check the PostgreSQL server log for errors.\n HINT: run the 'start' command if the stanza was previously stopped.\n]" time="2023-09-20T02:24:11Z" level=fatal msg="command terminated with exit code 82"

After some time this: kubectl logs itfdb-backup-x9tg-jfgfx -n pgo time="2023-09-20T02:28:47Z" level=info msg="crunchy-pgbackrest starts" time="2023-09-20T02:28:47Z" level=info msg="debug flag set to false" time="2023-09-20T02:28:47Z" level=info msg="backrest backup command requested" time="2023-09-20T02:28:47Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=full]" time="2023-09-20T02:29:49Z" level=info msg="output=[]" time="2023-09-20T02:29:49Z" level=info msg="stderr=[ERROR: [082]: WAL segment 00000016000000BF000000F9 was not archived before the 60000ms timeout\n HINT: check the archive_command to ensure that all options are correct (especially --stanza).\n HINT: check the PostgreSQL server log for errors.\n HINT: run the 'start' command if the stanza was previously stopped.\n]" time="2023-09-20T02:29:49Z" level=fatal msg="command terminated with exit code 82"

Tried to find more logs: Seems this is no longer valid:

Note that some logs needed to troubleshoot may be found in the /pgdata/<CLUSTERNAME>/pg_log directory on your Postgres instance.

bash-4.4$ ls bin dev home lib64 lost+found mnt pgconf proc run srv tmp var boot etc lib licenses media opt pgdata root sbin sys usr bash-4.4$ cd ./pgdata bash-4.4$ ls lost+found pg15 pg15_wal pgbackrest bash-4.4$ cd ./pg15 bash-4.4$ ls backup_label.old pg_dynshmem pg_notify pg_tblspc postgresql.base.conf.backup base pg_hba.conf pg_replslot pg_twophase postgresql.conf current_logfiles pg_hba.conf.backup pg_serial PG_VERSION postgresql.conf.backup global pg_ident.conf pg_snapshots pg_wal postmaster.opts log pg_ident.conf.backup pg_stat pg_xact postmaster.pid patroni.dynamic.json pg_logical pg_stat_tmp postgresql.auto.conf pg_commit_ts pg_multixact pg_subtrans postgresql.base.conf bash-4.4$ cd ./pg_log bash: cd: ./pg_log: No such file or directory bash-4.4$ cd ../ bash-4.4$ cd ./pgbackrest bash-4.4$ ls log bash-4.4$ cd ./log bash-4.4$ ls db-restore.log bash-4.4$ cat db-restore.log -------------------PROCESS START------------------- 2023-09-19 04:30:49.816 P00 INFO: restore command begin 2.45: --delta --exec-id=72-11e7425b --link-map=pg_wal=/pgdata/pg15_wal --log-path=/pgdata/pgbackrest/log --pg1-path=/pgdata/pg15 --repo=1 --repo1-host=itfdb-repo-host-0.itfdb-pods.pgo.svc.cluster.local. --repo1-host-ca-file=/etc/pgbackrest/conf.d/~postgres-operator/tls-ca.crt --repo1-host-cert-file=/etc/pgbackrest/conf.d/~postgres-operator/client-tls.crt --repo1-host-key-file=/etc/pgbackrest/conf.d/~postgres-operator/client-tls.key --repo1-host-type=tls --repo1-host-user=postgres --repo1-path=/pgbackrest/repo1 --repo2-path=/itfdb --repo2-s3-bucket=dvbackups --repo2-s3-endpoint=s3.us-east-1.wasabisys.com --repo2-s3-key= --repo2-s3-key-secret= --repo2-s3-region=us-east-1 --repo2-type=s3 --stanza=db --type=standby 2023-09-19 04:30:49.816 P00 WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg15' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted. 2023-09-19 04:30:49.869 P00 INFO: repo1: restore backup set 20230814-033327F, recovery will start at 2023-08-14 03:33:27 2023-09-19 04:30:49.869 P00 INFO: map link 'pg_wal' to '/pgdata/pg15_wal' 2023-09-19 04:31:00.194 P00 INFO: write updated /pgdata/pg15/postgresql.auto.conf 2023-09-19 04:31:00.208 P00 INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started) 2023-09-19 04:31:00.211 P00 INFO: restore size = 29.4MB, file total = 1270 2023-09-19 04:31:00.211 P00 INFO: restore command end: completed successfully (10396ms) bash-4.4$

pgo show backup itfdb --repoName=repo1 -n pgo
stanza: db
    status: ok
    cipher: none

    db (current)
        wal archive min/max (15): 000000010000000000000001/00000012000000C50000004E

        full backup: 20230814-033327F
            timestamp start/stop: 2023-08-14 03:33:27 / 2023-08-14 03:36:21
            wal start/stop: 000000010000000000000004 / 000000010000000000000006
            database size: 29.4MB, database backup size: 29.4MB
            repo1: backup set size: 3.9MB, backup size: 3.9MB

pgo show backup itfdb --repoName=repo2 -n pgo
stanza: db
    status: error (no valid backups)
    cipher: none

    db (prior)
        wal archive min/max (15): 000000010000000000000001/00000005000000BE000000C1

    db (current)
        wal archive min/max (15): 000000010000000000000001/000000010000000000000009

Steps to Reproduce

REPRO

pgo version Client Version: v0.3.0 Operator Version: v5.4.0

apiVersion: v1
data:
  init.sql: |
    \c itfdb
    GRANT CREATE ON SCHEMA public TO "itfdb";
kind: ConfigMap
metadata:
  name: init-sql
  namespace: pgo
---
apiVersion: v1
data:
  s3.conf: |
    W....reducted....DQo=
kind: Secret
metadata:
  name: pgo-s3-creds
  namespace: pgo
type: Opaque
---
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
  name: itfdb
  namespace: pgo
spec:
  backups:
    pgbackrest:
      configuration:
      - secret:
          name: pgo-s3-creds
      global:
        repo1-retention-full: "5"
        repo2-path: /itfdb
        repo2-retention-full: "5"
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.45-2
      repos:
      - name: repo1
        schedules:
          differential: 0 1 * * 1-6
          full: 0 1 * * 0
        volume:
          volumeClaimSpec:
            accessModes:
            - ReadWriteOnce
            resources:
              requests:
                storage: 4Gi
      - name: repo2
        s3:
          bucket: dvbackups
          endpoint: s3.us-east-1.wasabisys.com
          region: us-east-1
        schedules:
          differential: 0 1 * * 1-6
          full: 0 1 * * 0
  databaseInitSQL:
    key: init.sql
    name: init-sql
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres-gis:ubi8-15.3-3.3-2
  instances:
  - dataVolumeClaimSpec:
      accessModes:
      - ReadWriteOnce
      resources:
        requests:
          storage: 40Gi
    name: i1
    replicas: 1
  patroni:
    dynamicConfiguration:
      postgresql:
        parameters:
          max_wal_size: 256MB
  postgresVersion: 15

The PGO install is a default install\singlenamespace from examples with renamed namespace and operator name to pgo. (unfortunately can't export because kustomize want to fix it and after fix it does not work)

kubectl kustomize ../envs/common/pgo/install/singlenamespace > ./singlenamespace.yaml    
# Warning: 'bases' is deprecated. Please use 'resources' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
# Warning: 'patchesJson6902' is deprecated. Please use 'patches' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
# Warning: 'patchesStrategicMerge' is deprecated. Please use 'patches' instead. Run 'kustomize edit fix' to update your Kustomization automatically.
# Warning: 'patchesStrategicMerge' is deprecated. Please use 'patches' instead. Run 'kustomize edit fix' to update your Kustomization automatically.

tjmoore4 commented 1 year ago

Picking up this discussion in community support Discord channel so closing here.

vit-h commented 1 year ago

@tjmoore4 The issue is not in S3 repo. Here is only repo1 which is PVC, it fails.

With debug mode enabled: Explore-logs-2023-09-20 22_57_42.txt

Created https://github.com/pgbackrest/pgbackrest/issues/2188 as well

pgo show backup itfdb --repoName=repo2 -n pgo
No stanzas exist in the repository.

pgo show backup itfdb --repoName=repo1 -n pgo
stanza: db
    status: ok
    cipher: none

    db (current)
        wal archive min/max (15): 000000010000000000000001/00000012000000C50000004E

        full backup: 20230814-033327F
            timestamp start/stop: 2023-08-14 03:33:27 / 2023-08-14 03:36:21
            wal start/stop: 000000010000000000000004 / 000000010000000000000006
            database size: 29.4MB, database backup size: 29.4MB
            repo1: backup set size: 3.9MB, backup size: 3.9MB

pgo backup itfdb --repoName="repo1" --options="--type=full" -n pgo
postgresclusters/itfdb backup initiated
get pods -n pgo
NAME                              READY   STATUS    RESTARTS   AGE
itfdb-backup-t9md-fnvg4           0/1     Error     0          115s
itfdb-backup-t9md-hvblg           1/1     Running   0          49s
itfdb-i1-9hjz-0                   4/4     Running   0          44h
itfdb-repo-host-0                 2/2     Running   0          3d
itfdb-repo1-diff-28252860-5hf8r   0/1     Error     0          24h
...
itfdb-repo1-diff-28254300-pdc8z   0/1     Error     0          10m
pgo-7769bc568-9v7tr               1/1     Running   0          44h

kubectl logs itfdb-backup-t9md-fnvg4 -n pgo
time="2023-09-21T01:10:05Z" level=info msg="crunchy-pgbackrest starts"
time="2023-09-21T01:10:05Z" level=info msg="debug flag set to false"
time="2023-09-21T01:10:05Z" level=info msg="backrest backup command requested"
time="2023-09-21T01:10:05Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=full]"
time="2023-09-21T01:11:08Z" level=info msg="output=[]"
time="2023-09-21T01:11:08Z" level=info msg="stderr=[ERROR: [082]: WAL segment 00000016000000C50000004B was not archived before the 60000ms timeout\n       HINT: check the archive_command to ensure that all options are correct (especially --stanza).\n       HINT: check the PostgreSQL server log for errors.\n       HINT: run the 'start' command if the stanza was previously stopped.\n]"
time="2023-09-21T01:11:08Z" level=fatal msg="command terminated with exit code 82"

vit-h commented 1 year ago

@tsykoduk After I deleted the repo1 (PVC) it restarted my DB pod and then backups to PVC started working. For DEV env this is OK but not sure what to do if the same issue occures in PROD. What was the reason? How to handle it properly in prod?

And you are right S3 repo backups does not work even after removing all backup repos and creating only S3. Do not understand how to make it working.

vit-h commented 1 year ago

@tsykoduk I was able to do a backup to s3 from repo pod [postgres@itfdb-repo-host-0 /]$ pgbackrest backup --stanza=db --repo=2 --type=full --log-level-console=debug @itfdb-repo-host-0.txt

but it does not work when run from pgo client: kubectl pgo backup itfdb --repoName="repo2" --options="--type=full --log-level-file=debug --log-level-console=debug --log-level-stderr=debug" -n pgo PS C:\Users\vital> dv get pods -n pgo NAME READY STATUS RESTARTS AGE itfdb-backup-cdgz-2xz85 0/1 Error 0 4m31s itfdb-backup-cdgz-77th4 0/1 Error 0 100s itfdb-backup-cdgz-875th 0/1 Error 0 3m48s itfdb-backup-cdgz-ngbgs 0/1 Error 0 4m12s itfdb-backup-cdgz-ppl2p 0/1 Error 0 3m4s itfdb-backup-cdgz-ptqkt 0/1 Error 0 4m26s itfdb-i1-lgtr-0 4/4 Running 0 10h itfdb-i1-qddn-0 4/4 Running 0 11h itfdb-repo-host-0 2/2 Running 0 56m pgo-574c7cf8c-xc57b 1/1 Running 0 27h PS C:\Users\vital> dv kubectl logs itfdb-backup-cdgz-77th4 -n pgo time="2023-09-22T04:50:46Z" level=info msg="crunchy-pgbackrest starts" time="2023-09-22T04:50:46Z" level=info msg="debug flag set to false" time="2023-09-22T04:50:46Z" level=info msg="backrest backup command requested" time="2023-09-22T04:50:46Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=2 --type=full --log-level-file=debug --log-level-console=debug --log-level-stderr=debug]" time="2023-09-22T04:50:46Z" level=info msg="output=[]" time="2023-09-22T04:50:46Z" level=info msg="stderr=[hash b03259ad0b33196ff6f9c5181afcc4dca7472143 - does not match local hash 6f1ada90feb0e85d355f67d5e7484468e6f8d2b9 -]" time="2023-09-22T04:50:46Z" level=fatal msg="command terminated with exit code 1" PS C:\Users\vital>

CrunchyData / postgres-operator