fix(docker-entrypoint-initdb): bump wait timeout

deis / postgres

A PostgreSQL database used by Deis Workflow.

https://deis.com

MIT License

36 stars 22 forks source link

fix(docker-entrypoint-initdb): bump wait timeout #59

Closed bacongobbler closed 8 years ago

bacongobbler commented 8 years ago

The default wait time is 1 minute, which usually isn't enough time for a recovery to finish. Bumping to 20 minutes seems to alleviate the problem. We're not entirely sure why.

closes #56

bacongobbler commented 8 years ago

I tested this manually with minio to verify:

helm install deis-dev
deis register http://deis.$DEIS_TEST_DOMAIN --username bacongobbler --password pass --email mfisher@deis.com
deis keys:add ~/.ssh/id_rsa.pub
for i in $(seq 1 100); do deis create foo-$i --no-remote &>/dev/null; done
kubectl --namespace=deis delete pod deis-database-asdf12
# wait a bit for the database and workflow to come back up
deis apps | grep -v "=== Apps" | wc -l
100

jchauncey commented 8 years ago

Seeing the following:

        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-03-08T03:21:32.782932-00 pid=5109
FATAL:  the database system is starting up
.wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2016-03-08T03:21:33.024814-00 pid=5109 action=wal-fetch key=s3://dbwal/wal_005/0000000100000000000000F6.lzo prefix= seg=0000000100000000000000F6
LOG:  restored log file "0000000100000000000000F6" from archive
FATAL:  the database system is starting up
.wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-03-08T03:21:34.110124-00 pid=5126
wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2016-03-08T03:21:34.393374-00 pid=5126 action=wal-fetch key=s3://dbwal/wal_005/0000000100000000000000F7.lzo prefix= seg=0000000100000000000000F7
LOG:  restored log file "0000000100000000000000F7" from archive
FATAL:  the database system is starting up

bacongobbler commented 8 years ago

@jchauncey that is actually an expected (but not really) error. -w makes us wait for the database to start up. When in recovery it is not considered ready to accept connections so it'll output those errors as FATAL... Even though in our case it isn't. I'll have to fix that up in the future.

bacongobbler commented 8 years ago

Just give it a little longer to recover :)

jchauncey commented 8 years ago

this has been going for over an hour and im still getting that error.

bacongobbler commented 8 years ago

Okay so then this does not fix #56.

krancour commented 8 years ago

I know @jchauncey said he's still having trouble with this, but I've manually tested this with good results. That being said, long-running recoveries could still be problematic. See #53, which, when fixed, would help alleviate that a bit. I think even if this is not perfect, it's such a significant improvement that it LGTM.

jchauncey commented 8 years ago

Ok so this eventually recovered during the night. Definitely took a while