Closed Proplex closed 6 years ago
Adjusted for feedback, I've added a bosh run-errand recover
task that will make it easier for operators to start up PSQL after a failure mode.
Re-updated feedback to change from BOSH errand to just a script on VM.
We found that the parameters around checking who was master was too strict. A single PSQL error (such as connection reset) for transient errors would put the replica into master and have a dual master-master (split-brain) configuration.
We now changed that so that three consistent errors are necessary for the replica to become master in scenarios where the master is running, but not accepting PSQL commands.
We've also added a check for split-brain configurations. We've piggy-backed the status checks to also check for scenarios where both nodes are master. If they are, both nodes immediately shut down their postgres, haproxy, and monitor processes. This sets the VM to "failure" status in BOSH, which should be a very easy find for those with monitoring solutions (e.g. Prometheus). To recover from this failure mode, look at README.md, where it is explained step-by-step (it's easy).