Second db restore fails

krancour commented 8 years ago

A previous restore from backup succeeded, but on a second restore, the pod goes into a crash loop with the following logs:

k logs -f deis-database-pwceo --namespace=deis
The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
creating template1 database in /var/lib/postgresql/data/base/1 ... ok
initializing pg_authid ... ok
initializing dependencies ... ok
creating system views ... ok
loading system objects' descriptions ... ok
creating collations ... ok
creating conversions ... ok
creating dictionaries ... ok
setting privileges on built-in objects ... ok
creating information schema ... ok
loading PL/pgSQL server-side language ... ok
vacuuming database template1 ... ok
copying template1 to template0 ... ok
copying template1 to postgres ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

    postgres -D /var/lib/postgresql/data
or
    pg_ctl -D /var/lib/postgresql/data -l logfile start

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.
waiting for server to start....LOG:  database system was shut down at 2016-06-07 22:18:47 UTC
LOG:  MultiXact member wraparound protections are now enabled
LOG:  autovacuum launcher started
LOG:  database system is ready to accept connections
 done
server started
CREATE DATABASE

CREATE ROLE

/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/001_setup_envdir.sh

/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/002_create_bucket.sh

/docker-entrypoint.sh: running /docker-entrypoint-initdb.d/003_restore_from_backup.sh
Rebooting postgres to enable archive mode
LOG:  received smart shutdown request
waiting for server to shut down....LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
 done
server stopped
waiting for server to start....LOG:  database system was shut down at 2016-06-07 22:18:50 UTC
LOG:  MultiXact member wraparound protections are now enabled
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
 done
server started
Found backups. Restoring from backup...
waiting for server to shut down...LOG:  received smart shutdown request
.LOG:  autovacuum launcher shutting down
LOG:  shutting down
LOG:  database system is shut down
 done
server stopped
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "backup-fetch".
        STRUCTURED: time=2016-06-07T22:18:54.358368-00 pid=104
wal_e.blobstore.s3.calling_format INFO     MSG: connecting to S3 with WALE_S3_ENDPOINT (http://10.3.0.7:9000)
        STRUCTURED: time=2016-06-07T22:18:54.426621-00 pid=104
botocore.credentials INFO     Found credentials in environment variables.
boto3.resources.collection INFO     Calling paginated s3:list_objects with {'Prefix': 'basebackups_005/', u'Bucket': 'dbwal'}
botocore.vendored.requests.packages.urllib3.connectionpool INFO     Starting new HTTP connection (1): 10.3.0.7
wal_e.blobstore.s3.calling_format INFO     MSG: connecting to S3 with WALE_S3_ENDPOINT (http://10.3.0.7:9000)
        STRUCTURED: time=2016-06-07T22:18:54.495469-00 pid=104
wal_e.blobstore.s3.calling_format INFO     MSG: connecting to S3 with WALE_S3_ENDPOINT (http://10.3.0.7:9000)
        STRUCTURED: time=2016-06-07T22:18:54.499328-00 pid=104
boto3.resources.action INFO     Calling s3:get_object with {u'Bucket': 'dbwal', u'Key': 'basebackups_005/base_000000020000000000000010_00000040_backup_stop_sentinel.json'}
botocore.vendored.requests.packages.urllib3.connectionpool INFO     Starting new HTTP connection (1): 10.3.0.7
wal_e.blobstore.s3.calling_format INFO     MSG: connecting to S3 with WALE_S3_ENDPOINT (http://10.3.0.7:9000)
        STRUCTURED: time=2016-06-07T22:18:54.512405-00 pid=104
wal_e.blobstore.s3.calling_format INFO     MSG: connecting to S3 with WALE_S3_ENDPOINT (http://10.3.0.7:9000)
        STRUCTURED: time=2016-06-07T22:18:54.516984-00 pid=104
wal_e.blobstore.s3.calling_format INFO     MSG: connecting to S3 with WALE_S3_ENDPOINT (http://10.3.0.7:9000)
        STRUCTURED: time=2016-06-07T22:18:54.520362-00 pid=104
wal_e.blobstore.s3.calling_format INFO     MSG: connecting to S3 with WALE_S3_ENDPOINT (http://10.3.0.7:9000)
        STRUCTURED: time=2016-06-07T22:18:54.523849-00 pid=104
boto3.resources.collection INFO     Calling paginated s3:list_objects with {'Prefix': 'basebackups_005/base_000000020000000000000010_00000040/tar_partitions/', u'Bucket': 'dbwal'}
botocore.vendored.requests.packages.urllib3.connectionpool INFO     Starting new HTTP connection (1): 10.3.0.7
wal_e.worker.s3.s3_worker INFO     MSG: beginning partition download
        DETAIL: The partition being downloaded is part_00000000.tar.lzo.
        HINT: The absolute S3 key is basebackups_005/base_000000020000000000000010_00000040/tar_partitions/part_00000000.tar.lzo.
        STRUCTURED: time=2016-06-07T22:18:54.640887-00 pid=104
boto3.resources.action INFO     Calling s3:get_object with {u'Bucket': 'dbwal', u'Key': 'basebackups_005/base_000000020000000000000010_00000040/tar_partitions/part_00000000.tar.lzo'}
waiting for server to start....LOG:  database system was interrupted; last known up at 2016-06-07 22:14:09 UTC
LOG:  creating missing WAL directory "pg_xlog/archive_status"
LOG:  starting archive recovery
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-06-07T22:19:01.554035-00 pid=124
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2016-06-07T22:19:01.624515-00 pid=124 action=wal-fetch key=s3://dbwal/wal_005/00000002.history.lzo prefix= seg=00000002.history state=begin
wal_e.blobstore.s3.calling_format INFO     MSG: connecting to S3 with WALE_S3_ENDPOINT (http://10.3.0.7:9000)
        STRUCTURED: time=2016-06-07T22:19:01.725490-00 pid=124
botocore.credentials INFO     Found credentials in environment variables.
boto3.resources.action INFO     Calling s3:get_object with {u'Bucket': 'dbwal', u'Key': 'wal_005/00000002.history.lzo'}
botocore.vendored.requests.packages.urllib3.connectionpool INFO     Starting new HTTP connection (1): 10.3.0.7
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is s3://dbwal/wal_005/00000002.history.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2016-06-07T22:19:01.919552-00 pid=124
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2016-06-07T22:19:02.020513-00 pid=124 action=wal-fetch key=s3://dbwal/wal_005/00000002.history.lzo prefix= seg=00000002.history state=complete
FATAL:  the database system is starting up
.wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-06-07T22:19:02.285439-00 pid=133
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2016-06-07T22:19:02.360919-00 pid=133 action=wal-fetch key=s3://dbwal/wal_005/000000020000000000000010.lzo prefix= seg=000000020000000000000010 state=begin
wal_e.blobstore.s3.calling_format INFO     MSG: connecting to S3 with WALE_S3_ENDPOINT (http://10.3.0.7:9000)
        STRUCTURED: time=2016-06-07T22:19:02.462919-00 pid=133
botocore.credentials INFO     Found credentials in environment variables.
FATAL:  the database system is starting up
.boto3.resources.action INFO     Calling s3:get_object with {u'Bucket': 'dbwal', u'Key': 'wal_005/000000020000000000000010.lzo'}
botocore.vendored.requests.packages.urllib3.connectionpool INFO     Starting new HTTP connection (1): 10.3.0.7
lzop: <stdin>: not a lzop file
wal_e.blobstore.s3.s3_util WARNING  MSG: could no longer locate object while performing wal restore
        DETAIL: The absolute URI that could not be located is s3://dbwal/wal_005/000000020000000000000010.lzo.
        HINT: This can be normal when Postgres is trying to detect what timelines are available during restoration.
        STRUCTURED: time=2016-06-07T22:19:03.675052-00 pid=133
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2016-06-07T22:19:03.776365-00 pid=133 action=wal-fetch key=s3://dbwal/wal_005/000000020000000000000010.lzo prefix= seg=000000020000000000000010 state=complete
LOG:  invalid checkpoint record
FATAL:  could not locate required checkpoint record
HINT:  If you are not restoring from a backup, try removing the file "/var/lib/postgresql/data/backup_label".
LOG:  startup process (PID 122) exited with exit code 1
LOG:  aborting startup due to startup process failure
 stopped waiting
pg_ctl: could not start server
Examine the log output.

bacongobbler commented 8 years ago

this is likely fixed via https://github.com/deis/postgres/pull/112 but I have no evidence to back that up as I'm not sure what causes this. If you have a tarball of your minio buckets I'd be happy to try and reproduce.

krancour commented 8 years ago

I've been able to produce this same outcome on two test clusters now-- one in micro-kube and one in AWS. Both were using minio.

bacongobbler commented 8 years ago

An interesting thread on this discussion, implying that the server was in the middle of a backup during shutdown, was cleanly shut down but hit the checkpoint error due to that file: https://www.postgresql.org/message-id/D960CB61B694CF459DCFB4B0128514C293CEB7@exadv11.host.magwien.gv.at

krancour commented 8 years ago

I'll see if I can tar up the minio buckets.

bacongobbler commented 8 years ago

thanks! I'd like to see if removing the backup label will allow you to carry on without issues.

arschles commented 8 years ago

I just repro-ed the same issue on GKE with minio, using the same repro steps

bacongobbler commented 8 years ago

this can be reliably reproduced with the following tarball and patch. Just place dbwal.tar.gz at contrib/ci/tmp/minio/dbwal.tar.gz, apply the patch and follow the logs from make docker-build test. I am currently working on a fix to this issue, but here is a reliable starting point from which we can craft some fancy integration tests so this never happens again.

https://gist.github.com/bacongobbler/7520e558bbd8e69394b2832239f0fe73

dbwal.tar.gz

bacongobbler commented 8 years ago

branch to test my hunch: https://github.com/bacongobbler/postgres/tree/fix-checkpoint-error

~~specifically https://github.com/bacongobbler/postgres/commit/8ee5e4c5c14b309c1a14e80c3592aadb8c91c3aa should hopefully fix the issue.~~

EDIT: nope. see below.

bacongobbler commented 8 years ago

So I noticed something while looking into the backup_label file...

root@f7c8b6f65428:/var/lib/postgresql/data# cat backup_label
START WAL LOCATION: 0/10000028 (file 000000020000000000000010)
CHECKPOINT LOCATION: 0/10000028
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2016-06-07 22:14:09 UTC
LABEL: freeze_start_2016-06-07T22:14:08.397343+00:00

Notice that the start WAL location is 000000020000000000000010? Check out what's in minio:

minio@aa513d788d04:~/dbwal/wal_005$ ls
000000010000000000000002.lzo  000000010000000000000005.lzo  000000010000000000000008.lzo  00000001000000000000000B.lzo  00000001000000000000000E.lzo
000000010000000000000003.lzo  000000010000000000000006.lzo  000000010000000000000009.lzo  00000001000000000000000C.lzo  000000020000000000000011.lzo
000000010000000000000004.lzo  000000010000000000000007.lzo  00000001000000000000000A.lzo  00000001000000000000000D.lzo

Notice something missing? ;)

It appears that the first WAL log is (occasionally) not being shipped to minio, and that's because we did not enable archive_mode = on for the first reboot in 003_restore_from_backup.sh, which starts shipping WAL logs immediately after booting. We enable archive mode in 004. This is a simple fix that just requires us enabling archive_mode in 003 here before we boot up the server.

Eventually this will be fixed in #112 but I'll try to hack up a quicker fix for now.

bacongobbler commented 8 years ago

credit goes to https://github.com/wal-e/wal-e/issues/251 for pointing me in the right direction.

deis / postgres

Second db restore fails #114