deis / postgres

A PostgreSQL database used by Deis Workflow.
https://deis.com
MIT License
36 stars 21 forks source link

Recovery doesn't go so smoothly #54

Closed krancour closed 8 years ago

krancour commented 8 years ago

@bacongobbler and I have already discussed this at length. I just want to formally capture the issue and what is known about the root cause.

I've been testing out various components in conjunction with S3 as the object store backend today. One test I've been through a number of times now includes killing the database pod to ensure its replacement is properly restored from backups in S3. For the majority of my attempts, this has not gone so smoothly.

Major kudos to @bacongobbler for patiently walking me through a backup and restore process that I didn't really understand going in.

Here are a couple problems we have identified.

  1. During recovery the postgres server shuts down, then after a recovery.conf file has been written, it is restarted with the -w (wait) option. By default, this waits 60 seconds for the database to start, then moves on. Trouble is, if the recovery doesn't complete within 60 seconds, the next script in the initialization process will proceed while recovery is still in-flight. This is problematic because this next step actually makes an initial backup of the db-- which is not in a stable state at that moment.

    To work around this, it's probably best that a timeout be explicitly specified (in seconds) using the -t option here:

    https://github.com/deis/postgres/blob/master/rootfs/docker-entrypoint-initdb.d/003_restore_from_backup.sh#L47-L49

    What isn't clear is exactly how long we should wait. In my testing, I have seen restoration of my modest database sometimes exceed five minutes. Additionally, we should find a way to fail sooner if recovery doesn't complete within the allotted time, rather than allow the following initialization step proceed and possibly corrupt the backup data.

  2. The livenessProbe is sometimes also judging the database pod to be unhealthy before restoration has a chance to complete. Under such circumstances, the RC kills the pod and the recovery starts anew on the replacement pod. The livenessProbe should probably be replace with a readinessProbe.
krancour commented 8 years ago

Looks like https://github.com/deis/charts/pull/129 addresses the livenessProbe issue. The rest of my OP remains a concern.

bacongobbler commented 8 years ago

After a bit of testing, I've found that the first bullet point is moot. Once the database is stopped, it is rebooted, whcih then it notices it was halfway through a recovery and will continue replaying WAL logs from the last base backup. Only the livenessProbe is a real issue/blocker for beta.

This is problematic because this next step actually makes an initial backup of the db-- which is not in a stable state at that moment.

This actually is untrue, and I apologize for confusing you :)

Have a look at the following branch logic:

https://github.com/deis/postgres/blob/16d13629d72a512b9f99521de2064d5cf0f254b5/rootfs/docker-entrypoint-initdb.d/003_restore_from_backup.sh#L50-L53

This piece of code only runs if there was no backups in object storage. We perform a base backup only when we are first initializing the database (soon to change; see #53)

For your second bullet point, see https://github.com/deis/charts/pull/129 :)

krancour commented 8 years ago

@bacongobbler does this script not run next and start backing up the DB immediately upon next reboot? (Even if it's only partially restored at that time?)

https://github.com/deis/postgres/blob/16d13629d72a512b9f99521de2064d5cf0f254b5/rootfs/docker-entrypoint-initdb.d/004_setup_backup_restore.sh

This was my actual concern... I probably did not articulate that well enough.

bacongobbler commented 8 years ago

Nope! Even with archive_command and recovery_command set, postgres is smart enough and knows first to boot in recovery mode, then once it's replayed all WAL logs it will boot and then ship backups.

From http://www.postgresql.org/docs/9.1/static/continuous-archiving.html:

Start the server. The server will go into recovery mode and proceed to read through the archived WAL files it needs. Should the recovery be terminated because of an external error, the server can simply be restarted and it will continue recovery. Upon completion of the recovery process, the server will rename recovery.conf to recovery.done (to prevent accidentally re-entering recovery mode later) and then commence normal database operations.

krancour commented 8 years ago

Ah. Well that's good then! I'm going to play with this a bit more first thing tomorrow, chiefly for my own edification. Thanks for all the help!

bacongobbler commented 8 years ago

I'm going to close this since both points have now been addressed, but please feel free to re-open if your testing shows otherwise. :)

krancour commented 8 years ago

Thanks again for helping me through this.