deis/postgres dying halfway through restore

bacongobbler commented 8 years ago

~~This seems like an upstream kubernetes issue: https://github.com/kubernetes/kubernetes/issues/7891#issuecomment-191886569~~

I guess the crutch solution is to remove the readiness probe and let the database handle its own health checks, or find a way for readinessProbes to work with postgres that won't end up killing the pod.

Just looks like the container is being killed off:

LOG:  database system was shut down in recovery at 2016-03-03 17:26:33 UTC
LOG:  starting archive recovery
FATAL:  the database system is starting up
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-03-03T17:26:49.210204-00 pid=363
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2016-03-03T17:26:56.443877-00 pid=363 action=wal-fetch key=s3://kents-database/wal_005/000000010000000000000003.lzo prefix= seg=000000010000000000000003 state=begin
wal_e.blobstore.s3.calling_format INFO     MSG: connecting to S3 with creds (<botocore.credentials.Credentials object at 0x7f735b0faad0>)
        STRUCTURED: time=2016-03-03T17:26:56.573421-00 pid=363
botocore.credentials INFO     Found credentials in environment variables.
FATAL:  the database system is starting up
boto3.resources.action INFO     Calling s3:get_object with {u'Bucket': 'kents-database', u'Key': 'wal_005/000000010000000000000003.lzo'}
botocore.vendored.requests.packages.urllib3.connectionpool INFO     Starting new HTTPS connection (1): kents-database.s3.amazonaws.com
wal_e.blobstore.s3.s3_util INFO     MSG: completed download and decompression
        DETAIL: Downloaded and decompressed "s3://kents-database/wal_005/000000010000000000000003.lzo" to "pg_xlog/RECOVERYXLOG"
        STRUCTURED: time=2016-03-03T17:27:12.595790-00 pid=363
wal_e.operator.backup INFO     MSG: complete wal restore
        STRUCTURED: time=2016-03-03T17:27:12.878355-00 pid=363 action=wal-fetch key=s3://kents-database/wal_005/000000010000000000000003.lzo prefix= seg=000000010000000000000003 state=complete
LOG:  restored log file "000000010000000000000003" from archive
LOG:  redo starts at 0/3000090
LOG:  consistent recovery state reached at 0/4000000
FATAL:  the database system is starting up
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-03-03T17:27:29.382059-00 pid=427
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2016-03-03T17:27:35.576663-00 pid=427 action=wal-fetch key=s3://kents-database/wal_005/000000010000000000000004.lzo prefix= seg=000000010000000000000004 state=begin
wal_e.blobstore.s3.calling_format INFO     MSG: connecting to S3 with creds (<botocore.credentials.Credentials object at 0x7f151eb1aad0>)
        STRUCTURED: time=2016-03-03T17:27:35.697055-00 pid=427
botocore.credentials INFO     Found credentials in environment variables.
LOG:  server process (PID 221) exited with exit code 2
LOG:  terminating any other active server processes
FATAL:  could not restore file "000000010000000000000004" from archive: child process was terminated by signal 3: Quit
LOG:  all server processes terminated; reinitializing
LOG:  database system was interrupted while in recovery at log time 2016-03-03 16:42:36 UTC
HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
LOG:  starting archive recovery
LOG:  startup process (PID 489) was terminated by signal 9: Killed
LOG:  aborting startup due to startup process failure

jchauncey commented 8 years ago

This looks really close to the problem I was seeing with fluentd

krancour commented 8 years ago

@jchauncey I think this is unrelated to the fluentd problem you're referencing. I troubleshot this with @bacongobbler and one thing I kept seeing was that if I put a long enough timeout on the boot of the postgres server that initiates the recovery process and wait for it until it completes, everything is cool. Things only fall apart when we move on from that step prematurely, which seems to result in something within our own initialization process fatally interrupting recovery. So contrasted to your case, in this one, I think we know it was something in our own code causing this.

bacongobbler commented 8 years ago

keeping open since we know this is still an issue.

bacongobbler commented 8 years ago

This is intermittent and occasionally shows up, but after destroying and re-creating three separate clusters (2 minio, one S3) I don't see this issue, however let's keep posting issues on this to see if we need to come up with another solution.

bacongobbler commented 8 years ago

proof that backups are flaky at the moment: https://travis-ci.org/deis/postgres/builds/122642210#L4938

bacongobbler commented 8 years ago

closing as this behaviour has suddenly disappeared. Will re-open if this occurs again.

monaka commented 8 years ago

I got same failures after I applied PR #112. I'm going to inspect this.

monaka commented 8 years ago

The similar issue: https://github.com/wal-e/wal-e/issues/247

monaka commented 8 years ago

This behavior has been disappeared on my environment, same as the comment by @bacongobbler...

monaka commented 8 years ago

I got the root cause In my case. lzo files were uploaded to Azure Blob but some properties on specified files were broken. I'm not sure my case is same as this issue as it may be Azure Blob dependent.

deis / postgres

deis/postgres dying halfway through restore #56