Closed krancour closed 8 years ago
this is likely fixed via https://github.com/deis/postgres/pull/112 but I have no evidence to back that up as I'm not sure what causes this. If you have a tarball of your minio buckets I'd be happy to try and reproduce.
I've been able to produce this same outcome on two test clusters now-- one in micro-kube and one in AWS. Both were using minio.
An interesting thread on this discussion, implying that the server was in the middle of a backup during shutdown, was cleanly shut down but hit the checkpoint error due to that file: https://www.postgresql.org/message-id/D960CB61B694CF459DCFB4B0128514C293CEB7@exadv11.host.magwien.gv.at
I'll see if I can tar up the minio buckets.
thanks! I'd like to see if removing the backup label will allow you to carry on without issues.
I just repro-ed the same issue on GKE with minio, using the same repro steps
this can be reliably reproduced with the following tarball and patch. Just place dbwal.tar.gz at contrib/ci/tmp/minio/dbwal.tar.gz
, apply the patch and follow the logs from make docker-build test
. I am currently working on a fix to this issue, but here is a reliable starting point from which we can craft some fancy integration tests so this never happens again.
https://gist.github.com/bacongobbler/7520e558bbd8e69394b2832239f0fe73
branch to test my hunch: https://github.com/bacongobbler/postgres/tree/fix-checkpoint-error
specifically https://github.com/bacongobbler/postgres/commit/8ee5e4c5c14b309c1a14e80c3592aadb8c91c3aa should hopefully fix the issue.
EDIT: nope. see below.
So I noticed something while looking into the backup_label
file...
root@f7c8b6f65428:/var/lib/postgresql/data# cat backup_label
START WAL LOCATION: 0/10000028 (file 000000020000000000000010)
CHECKPOINT LOCATION: 0/10000028
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2016-06-07 22:14:09 UTC
LABEL: freeze_start_2016-06-07T22:14:08.397343+00:00
Notice that the start WAL location is 000000020000000000000010? Check out what's in minio:
minio@aa513d788d04:~/dbwal/wal_005$ ls
000000010000000000000002.lzo 000000010000000000000005.lzo 000000010000000000000008.lzo 00000001000000000000000B.lzo 00000001000000000000000E.lzo
000000010000000000000003.lzo 000000010000000000000006.lzo 000000010000000000000009.lzo 00000001000000000000000C.lzo 000000020000000000000011.lzo
000000010000000000000004.lzo 000000010000000000000007.lzo 00000001000000000000000A.lzo 00000001000000000000000D.lzo
Notice something missing? ;)
It appears that the first WAL log is (occasionally) not being shipped to minio, and that's because we did not enable archive_mode = on
for the first reboot in 003_restore_from_backup.sh
, which starts shipping WAL logs immediately after booting. We enable archive mode in 004. This is a simple fix that just requires us enabling archive_mode
in 003 here before we boot up the server.
Eventually this will be fixed in #112 but I'll try to hack up a quicker fix for now.
credit goes to https://github.com/wal-e/wal-e/issues/251 for pointing me in the right direction.
A previous restore from backup succeeded, but on a second restore, the pod goes into a crash loop with the following logs: