Closed bacongobbler closed 8 years ago
I tested this manually with minio to verify:
helm install deis-dev
deis register http://deis.$DEIS_TEST_DOMAIN --username bacongobbler --password pass --email mfisher@deis.com
deis keys:add ~/.ssh/id_rsa.pub
for i in $(seq 1 100); do deis create foo-$i --no-remote &>/dev/null; done
kubectl --namespace=deis delete pod deis-database-asdf12
# wait a bit for the database and workflow to come back up
deis apps | grep -v "=== Apps" | wc -l
100
Seeing the following:
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2016-03-08T03:21:32.782932-00 pid=5109
FATAL: the database system is starting up
.wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2016-03-08T03:21:33.024814-00 pid=5109 action=wal-fetch key=s3://dbwal/wal_005/0000000100000000000000F6.lzo prefix= seg=0000000100000000000000F6
LOG: restored log file "0000000100000000000000F6" from archive
FATAL: the database system is starting up
.wal_e.main INFO MSG: starting WAL-E
DETAIL: The subcommand is "wal-fetch".
STRUCTURED: time=2016-03-08T03:21:34.110124-00 pid=5126
wal_e.operator.backup INFO MSG: promoted prefetched wal segment
STRUCTURED: time=2016-03-08T03:21:34.393374-00 pid=5126 action=wal-fetch key=s3://dbwal/wal_005/0000000100000000000000F7.lzo prefix= seg=0000000100000000000000F7
LOG: restored log file "0000000100000000000000F7" from archive
FATAL: the database system is starting up
@jchauncey that is actually an expected (but not really) error. -w
makes us wait for the database to start up. When in recovery it is not considered ready to accept connections so it'll output those errors as FATAL... Even though in our case it isn't. I'll have to fix that up in the future.
Just give it a little longer to recover :)
this has been going for over an hour and im still getting that error.
Okay so then this does not fix #56.
I know @jchauncey said he's still having trouble with this, but I've manually tested this with good results. That being said, long-running recoveries could still be problematic. See #53, which, when fixed, would help alleviate that a bit. I think even if this is not perfect, it's such a significant improvement that it LGTM.
Ok so this eventually recovered during the night. Definitely took a while
The default wait time is 1 minute, which usually isn't enough time for a recovery to finish. Bumping to 20 minutes seems to alleviate the problem. We're not entirely sure why.
closes #56