rebooting loop after recovered by WAL-E.

monaka commented 8 years ago

The deis-database on my cluster was in rebooting loops. I'm not sure the reason why for now because It seems be started in my sleeping time.

wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "wal-fetch".
        STRUCTURED: time=2016-06-27T21:09:38.279511-00 pid=4212
wal_e.operator.backup INFO     MSG: promoted prefetched wal segment
        STRUCTURED: time=2016-06-27T21:09:40.429363-00 pid=4212 action=wal-fetch key=wabs://database/wal_005/0000000500000004000000EB.lzo prefix= seg=0000000500000004000000EB
LOG:  restored log file "0000000500000004000000EB" from archive
wal_e.main   INFO     MSG: starting WAL-E
        DETAIL: The subcommand is "backup-push".
        STRUCTURED: time=2016-06-27T21:09:45.837498-00 pid=4223
psql: FATAL:  the database system is starting up
FATAL:  the database system is starting up
wal_e.main   ERROR    MSG: could not run one or more external programs WAL-E depends upon
        DETAIL: Could not get a connection to the database: note that superuser access
        is required
        Could not run the following programs, are they installed? psql
        STRUCTURED: time=2016-06-27T21:09:47.631257-00 pid=4223

bacongobbler commented 8 years ago

Could not run the following programs, are they installed? psql

Are you running v2.0.0? That error indicates psql is not installed in your container.

monaka commented 8 years ago

@bacongobbler Yes, v2.0.0.

It was fixed after deleted by using kubectl delete po deis-database .... So the container image has psql. It's not impossible to delete psql in the instance by someone... but hard to image.

I thought it was occurred by memory exhaust. But the result of free on the node seems be healthy...

             total       used       free     shared    buffers     cached
Mem:       7139592    7067840      71752       8384     148344    4530472
-/+ buffers/cache:    2389024    4750568
Swap:      8388604     367704    8020900

bacongobbler commented 8 years ago

okay so if you're on v2.0.0 then the other reason this error would pop up is if wal-e could not get a connection to the database, as it says in the logs. Since the previous logs say

psql: FATAL: the database system is starting up

I would assume that is your issue, and that the database took an abnormally long time to boot. Once it was restarted it restored faster (likely connection issues to Azure?).

I've got a work-in-progress that removes the wait timeout, which is the likely cause for this issue. https://github.com/deis/postgres/pull/112

monaka commented 8 years ago

I see. I'll try the canary build after #112 is merged. And also trying more information when this issue was reproduced.

monaka commented 8 years ago

BTW, some persons using Kube on Azure may have DNS related issues.

It seems be reasonable if my issues I posted recently were DNS on Azure specific. Tests (light weight usages) may be passed but production usages may be failed.

monaka commented 8 years ago

I tried to #112 based builds and it seems to resolve this issue. Even though CI tests are failed randomly, it works well in my Workflow.

monaka commented 8 years ago


(I know PR #112 is WIP and it is going to fix this issue near the future.)~~~

I had some confusions about my canary images. Let me revert this comment.
But still I'm in trouble around there and be inspecting...

monaka commented 8 years ago

Even though I don't have no certain evidence, I guess it is caused when there was executed Wal-E and psql at the same time. (Maybe, psql inside Wal-E and psql outside WAL-E)

Deis/database container runs psql periodically. I think there is no need to run psql under the recover mode. My guess is reasonable?

monaka commented 8 years ago

Additional information:

Recovery failures with SIGQUIT may be decreased by upgrading the spec of the node running SkyDNS (not the node running deis/database). In my case, Azure D2_V2 (2core / 7GB RAM) to F4 (4core /8GB RAM).

And, upgrading specs seems not 'silver blullets'. Because still I have randomly termination by signal 3 from WAL-E. But a step forward.

(BTW, I'm curious. Referencing to official document, it is enough to run Deis Workflow by preparing 2 core * 2 nodes. But my cluster requires more specs. My nodes is off production state, just a few sample apps running. Why does the cluster require more power...)

monaka commented 8 years ago

At least on my Deis Workflow, this issue was resolved by #112. I guess this can be closed after #112 is merged.

bacongobbler commented 8 years ago

this should be resolved via #137. If it isn't please re-open the issue at wal-e/wal-e. Thanks!

deis / postgres

rebooting loop after recovered by WAL-E. #123