Closed vit-h closed 1 year ago
Picking up this discussion in community support Discord channel so closing here.
@tjmoore4 The issue is not in S3 repo. Here is only repo1 which is PVC, it fails.
With debug mode enabled: Explore-logs-2023-09-20 22_57_42.txt
Created https://github.com/pgbackrest/pgbackrest/issues/2188 as well
pgo show backup itfdb --repoName=repo2 -n pgo
No stanzas exist in the repository.
pgo show backup itfdb --repoName=repo1 -n pgo
stanza: db
status: ok
cipher: none
db (current)
wal archive min/max (15): 000000010000000000000001/00000012000000C50000004E
full backup: 20230814-033327F
timestamp start/stop: 2023-08-14 03:33:27 / 2023-08-14 03:36:21
wal start/stop: 000000010000000000000004 / 000000010000000000000006
database size: 29.4MB, database backup size: 29.4MB
repo1: backup set size: 3.9MB, backup size: 3.9MB
pgo backup itfdb --repoName="repo1" --options="--type=full" -n pgo
postgresclusters/itfdb backup initiated
get pods -n pgo
NAME READY STATUS RESTARTS AGE
itfdb-backup-t9md-fnvg4 0/1 Error 0 115s
itfdb-backup-t9md-hvblg 1/1 Running 0 49s
itfdb-i1-9hjz-0 4/4 Running 0 44h
itfdb-repo-host-0 2/2 Running 0 3d
itfdb-repo1-diff-28252860-5hf8r 0/1 Error 0 24h
...
itfdb-repo1-diff-28254300-pdc8z 0/1 Error 0 10m
pgo-7769bc568-9v7tr 1/1 Running 0 44h
kubectl logs itfdb-backup-t9md-fnvg4 -n pgo
time="2023-09-21T01:10:05Z" level=info msg="crunchy-pgbackrest starts"
time="2023-09-21T01:10:05Z" level=info msg="debug flag set to false"
time="2023-09-21T01:10:05Z" level=info msg="backrest backup command requested"
time="2023-09-21T01:10:05Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=full]"
time="2023-09-21T01:11:08Z" level=info msg="output=[]"
time="2023-09-21T01:11:08Z" level=info msg="stderr=[ERROR: [082]: WAL segment 00000016000000C50000004B was not archived before the 60000ms timeout\n HINT: check the archive_command to ensure that all options are correct (especially --stanza).\n HINT: check the PostgreSQL server log for errors.\n HINT: run the 'start' command if the stanza was previously stopped.\n]"
time="2023-09-21T01:11:08Z" level=fatal msg="command terminated with exit code 82"
@tsykoduk After I deleted the repo1 (PVC) it restarted my DB pod and then backups to PVC started working. For DEV env this is OK but not sure what to do if the same issue occures in PROD. What was the reason? How to handle it properly in prod?
And you are right S3 repo backups does not work even after removing all backup repos and creating only S3. Do not understand how to make it working.
@tsykoduk I was able to do a backup to s3 from repo pod [postgres@itfdb-repo-host-0 /]$ pgbackrest backup --stanza=db --repo=2 --type=full --log-level-console=debug @itfdb-repo-host-0.txt
but it does not work when run from pgo client: kubectl pgo backup itfdb --repoName="repo2" --options="--type=full --log-level-file=debug --log-level-console=debug --log-level-stderr=debug" -n pgo PS C:\Users\vital> dv get pods -n pgo NAME READY STATUS RESTARTS AGE itfdb-backup-cdgz-2xz85 0/1 Error 0 4m31s itfdb-backup-cdgz-77th4 0/1 Error 0 100s itfdb-backup-cdgz-875th 0/1 Error 0 3m48s itfdb-backup-cdgz-ngbgs 0/1 Error 0 4m12s itfdb-backup-cdgz-ppl2p 0/1 Error 0 3m4s itfdb-backup-cdgz-ptqkt 0/1 Error 0 4m26s itfdb-i1-lgtr-0 4/4 Running 0 10h itfdb-i1-qddn-0 4/4 Running 0 11h itfdb-repo-host-0 2/2 Running 0 56m pgo-574c7cf8c-xc57b 1/1 Running 0 27h PS C:\Users\vital> dv kubectl logs itfdb-backup-cdgz-77th4 -n pgo time="2023-09-22T04:50:46Z" level=info msg="crunchy-pgbackrest starts" time="2023-09-22T04:50:46Z" level=info msg="debug flag set to false" time="2023-09-22T04:50:46Z" level=info msg="backrest backup command requested" time="2023-09-22T04:50:46Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=2 --type=full --log-level-file=debug --log-level-console=debug --log-level-stderr=debug]" time="2023-09-22T04:50:46Z" level=info msg="output=[]" time="2023-09-22T04:50:46Z" level=info msg="stderr=[hash b03259ad0b33196ff6f9c5181afcc4dca7472143 - does not match local hash 6f1ada90feb0e85d355f67d5e7484468e6f8d2b9 -]" time="2023-09-22T04:50:46Z" level=fatal msg="command terminated with exit code 1" PS C:\Users\vital>
Overview
Please help me to enable Partial and Full backups and syncronize replicas. Not sure It is a bug or my misconfig. This is my demo env and I am working on the PROD setup now. This situation scares me about going with PGO in PROD, I spent a week already with chat GPT trying to knok this issue down. I trusted a conference video that it is easy, but with new versions it appears new chalanges and lack of docs, stack overflow questions and community make it so hard to solve without asking here.
Ready to jump on the call and screenshare (Discord: vit_h)
Environment
Please provide the following details:
Logs and console
Explore-logs-2023-09-19 22_02_43.txt Explore-logs-2023-09-19 22_02_49.json.txt
get pods -n pgo NAME READY STATUS RESTARTS AGE itfdb-backup-7nj8-2f69w 0/1 Error 0 41m itfdb-backup-7nj8-4w7xw 0/1 Error 0 36m itfdb-backup-7nj8-57k66 0/1 Error 0 40m itfdb-backup-7nj8-cpl9d 0/1 Error 0 32m itfdb-backup-7nj8-d5glq 0/1 Error 0 44m itfdb-backup-7nj8-fvwsp 0/1 Error 0 43m itfdb-backup-7nj8-zhs76 0/1 Error 0 38m itfdb-i1-9hjz-0 4/4 Running 0 21h itfdb-i1-hnvh-0 4/4 Running 2 (22h ago) 23h itfdb-repo-host-0 2/2 Running 0 2d1h itfdb-repo1-diff-28252860-5hf8r 0/1 Error 0 71m itfdb-repo1-diff-28252860-66xn4 0/1 Error 0 72m itfdb-repo1-diff-28252860-6mms7 0/1 Error 0 61m itfdb-repo1-diff-28252860-h7dst 0/1 Error 0 67m itfdb-repo1-diff-28252860-lzb4x 0/1 Error 0 65m itfdb-repo1-diff-28252860-pwt2n 0/1 Error 0 69m itfdb-repo1-diff-28252860-t4bbd 0/1 Error 0 73m itfdb-repo2-diff-28252860-99tvl 0/1 Error 0 73m itfdb-repo2-diff-28252860-bgkdj 0/1 Error 0 70m itfdb-repo2-diff-28252860-bkdz9 0/1 Error 0 72m itfdb-repo2-diff-28252860-bnwml 0/1 Error 0 67m itfdb-repo2-diff-28252860-d6h9l 0/1 Error 0 72m itfdb-repo2-diff-28252860-d8bxd 0/1 Error 0 73m itfdb-repo2-diff-28252860-xwmgj 0/1 Error 0 73m pgo-7769bc568-9v7tr 1/1 Running 0 21h
kubectl logs itfdb-repo1-diff-28252860-66xn4 -n pgo time="2023-09-20T01:01:09Z" level=info msg="crunchy-pgbackrest starts" time="2023-09-20T01:01:09Z" level=info msg="debug flag set to false" time="2023-09-20T01:01:09Z" level=info msg="backrest backup command requested" time="2023-09-20T01:01:09Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=diff]" time="2023-09-20T01:02:12Z" level=info msg="output=[]" time="2023-09-20T01:02:12Z" level=info msg="stderr=[ERROR: [082]: WAL segment 00000013000000CA00000026 was not archived before the 60000ms timeout\n HINT: check the archive_command to ensure that all options are correct (especially --stanza).\n HINT: check the PostgreSQL server log for errors.\n HINT: run the 'start' command if the stanza was previously stopped.\n]" time="2023-09-20T01:02:12Z" level=fatal msg="command terminated with exit code 82"
kubectl logs itfdb-backup-7nj8-2f69w -n pgo time="2023-09-20T01:31:42Z" level=info msg="crunchy-pgbackrest starts" time="2023-09-20T01:31:42Z" level=info msg="debug flag set to false" time="2023-09-20T01:31:42Z" level=info msg="backrest backup command requested" time="2023-09-20T01:31:42Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=full]" time="2023-09-20T01:32:45Z" level=info msg="output=[]" time="2023-09-20T01:32:45Z" level=info msg="stderr=[ERROR: [082]: WAL segment 00000013000000CA00000049 was not archived before the 60000ms timeout\n HINT: check the archive_command to ensure that all options are correct (especially --stanza).\n HINT: check the PostgreSQL server log for errors.\n HINT: run the 'start' command if the stanza was previously stopped.\n]" time="2023-09-20T01:32:45Z" level=fatal msg="command terminated with exit code 82"
bash-4.4$ patronictl list
itfdb-i1-hnvh-0 accumuleted 50 GB of WAL files so I had to kill it and failover to the itfdb-i1-9hjz-0. Then scaled down to one instance.
Then I started getting this error when trying to make a backup kubectl logs itfdb-backup-x9tg-wt9vx -n pgo time="2023-09-20T02:23:08Z" level=info msg="crunchy-pgbackrest starts" time="2023-09-20T02:23:08Z" level=info msg="debug flag set to false" time="2023-09-20T02:23:08Z" level=info msg="backrest backup command requested" time="2023-09-20T02:23:08Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=full]" time="2023-09-20T02:24:11Z" level=info msg="output=[]" time="2023-09-20T02:24:11Z" level=info msg="stderr=[WARN: unable to check pg2: [DbConnectError] raised from remote-0 tls protocol on 'itfdb-i1-hnvh-0.itfdb-pods.pgo.svc.cluster.local.': unable to connect to 'dbname='postgres' port=5432 host='/tmp/postgres'': connection to server on socket \"/tmp/postgres/.s.PGSQL.5432\" failed: No such file or directory\n \tIs the server running locally and accepting connections on that socket?\nERROR: [082]: WAL segment 00000016000000BF000000F2 was not archived before the 60000ms timeout\n HINT: check the archive_command to ensure that all options are correct (especially --stanza).\n HINT: check the PostgreSQL server log for errors.\n HINT: run the 'start' command if the stanza was previously stopped.\n]" time="2023-09-20T02:24:11Z" level=fatal msg="command terminated with exit code 82"
After some time this: kubectl logs itfdb-backup-x9tg-jfgfx -n pgo time="2023-09-20T02:28:47Z" level=info msg="crunchy-pgbackrest starts" time="2023-09-20T02:28:47Z" level=info msg="debug flag set to false" time="2023-09-20T02:28:47Z" level=info msg="backrest backup command requested" time="2023-09-20T02:28:47Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1 --type=full]" time="2023-09-20T02:29:49Z" level=info msg="output=[]" time="2023-09-20T02:29:49Z" level=info msg="stderr=[ERROR: [082]: WAL segment 00000016000000BF000000F9 was not archived before the 60000ms timeout\n HINT: check the archive_command to ensure that all options are correct (especially --stanza).\n HINT: check the PostgreSQL server log for errors.\n HINT: run the 'start' command if the stanza was previously stopped.\n]" time="2023-09-20T02:29:49Z" level=fatal msg="command terminated with exit code 82"
Tried to find more logs: Seems this is no longer valid:
bash-4.4$ ls bin dev home lib64 lost+found mnt pgconf proc run srv tmp var boot etc lib licenses media opt pgdata root sbin sys usr bash-4.4$ cd ./pgdata bash-4.4$ ls lost+found pg15 pg15_wal pgbackrest bash-4.4$ cd ./pg15 bash-4.4$ ls backup_label.old pg_dynshmem pg_notify pg_tblspc postgresql.base.conf.backup base pg_hba.conf pg_replslot pg_twophase postgresql.conf current_logfiles pg_hba.conf.backup pg_serial PG_VERSION postgresql.conf.backup global pg_ident.conf pg_snapshots pg_wal postmaster.opts log pg_ident.conf.backup pg_stat pg_xact postmaster.pid patroni.dynamic.json pg_logical pg_stat_tmp postgresql.auto.conf pg_commit_ts pg_multixact pg_subtrans postgresql.base.conf bash-4.4$ cd ./pg_log bash: cd: ./pg_log: No such file or directory bash-4.4$ cd ../ bash-4.4$ cd ./pgbackrest bash-4.4$ ls log bash-4.4$ cd ./log bash-4.4$ ls db-restore.log bash-4.4$ cat db-restore.log -------------------PROCESS START------------------- 2023-09-19 04:30:49.816 P00 INFO: restore command begin 2.45: --delta --exec-id=72-11e7425b --link-map=pg_wal=/pgdata/pg15_wal --log-path=/pgdata/pgbackrest/log --pg1-path=/pgdata/pg15 --repo=1 --repo1-host=itfdb-repo-host-0.itfdb-pods.pgo.svc.cluster.local. --repo1-host-ca-file=/etc/pgbackrest/conf.d/~postgres-operator/tls-ca.crt --repo1-host-cert-file=/etc/pgbackrest/conf.d/~postgres-operator/client-tls.crt --repo1-host-key-file=/etc/pgbackrest/conf.d/~postgres-operator/client-tls.key --repo1-host-type=tls --repo1-host-user=postgres --repo1-path=/pgbackrest/repo1 --repo2-path=/itfdb --repo2-s3-bucket=dvbackups --repo2-s3-endpoint=s3.us-east-1.wasabisys.com --repo2-s3-key= --repo2-s3-key-secret= --repo2-s3-region=us-east-1 --repo2-type=s3 --stanza=db --type=standby
2023-09-19 04:30:49.816 P00 WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg15' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
2023-09-19 04:30:49.869 P00 INFO: repo1: restore backup set 20230814-033327F, recovery will start at 2023-08-14 03:33:27
2023-09-19 04:30:49.869 P00 INFO: map link 'pg_wal' to '/pgdata/pg15_wal'
2023-09-19 04:31:00.194 P00 INFO: write updated /pgdata/pg15/postgresql.auto.conf
2023-09-19 04:31:00.208 P00 INFO: restore global/pg_control (performed last to ensure aborted restores cannot be started)
2023-09-19 04:31:00.211 P00 INFO: restore size = 29.4MB, file total = 1270
2023-09-19 04:31:00.211 P00 INFO: restore command end: completed successfully (10396ms)
bash-4.4$
Steps to Reproduce
REPRO
pgo version Client Version: v0.3.0 Operator Version: v5.4.0
The PGO install is a default install\singlenamespace from examples with renamed namespace and operator name to pgo. (unfortunately can't export because kustomize want to fix it and after fix it does not work)