Allow one-off PSQL errors during checks, check for split-brain

Proplex commented 6 years ago

We found that the parameters around checking who was master was too strict. A single PSQL error (such as connection reset) for transient errors would put the replica into master and have a dual master-master (split-brain) configuration.

We now changed that so that three consistent errors are necessary for the replica to become master in scenarios where the master is running, but not accepting PSQL commands.

We've also added a check for split-brain configurations. We've piggy-backed the status checks to also check for scenarios where both nodes are master. If they are, both nodes immediately shut down their postgres, haproxy, and monitor processes. This sets the VM to "failure" status in BOSH, which should be a very easy find for those with monitoring solutions (e.g. Prometheus). To recover from this failure mode, look at README.md, where it is explained step-by-step (it's easy).

Proplex commented 6 years ago

Adjusted for feedback, I've added a bosh run-errand recover task that will make it easier for operators to start up PSQL after a failure mode.

Proplex commented 6 years ago

Re-updated feedback to change from BOSH errand to just a script on VM.

cloudfoundry-community / postgres-boshrelease

Allow one-off PSQL errors during checks, check for split-brain #26