Closed bacongobbler closed 8 years ago
This looks really close to the problem I was seeing with fluentd
@jchauncey I think this is unrelated to the fluentd problem you're referencing. I troubleshot this with @bacongobbler and one thing I kept seeing was that if I put a long enough timeout on the boot of the postgres server that initiates the recovery process and wait for it until it completes, everything is cool. Things only fall apart when we move on from that step prematurely, which seems to result in something within our own initialization process fatally interrupting recovery. So contrasted to your case, in this one, I think we know it was something in our own code causing this.
keeping open since we know this is still an issue.
This is intermittent and occasionally shows up, but after destroying and re-creating three separate clusters (2 minio, one S3) I don't see this issue, however let's keep posting issues on this to see if we need to come up with another solution.
proof that backups are flaky at the moment: https://travis-ci.org/deis/postgres/builds/122642210#L4938
closing as this behaviour has suddenly disappeared. Will re-open if this occurs again.
I got same failures after I applied PR #112. I'm going to inspect this.
The similar issue: https://github.com/wal-e/wal-e/issues/247
This behavior has been disappeared on my environment, same as the comment by @bacongobbler...
I got the root cause In my case. lzo files were uploaded to Azure Blob but some properties on specified files were broken. I'm not sure my case is same as this issue as it may be Azure Blob dependent.
This seems like an upstream kubernetes issue: https://github.com/kubernetes/kubernetes/issues/7891#issuecomment-191886569I guess the crutch solution is to remove thereadiness
probe and let the database handle its own health checks, or find a way for readinessProbes to work with postgres that won't end up killing the pod.Just looks like the container is being killed off: