compose / governor

Runners to orchestrate a high-availability PostgreSQL
MIT License
512 stars 75 forks source link

Local Docker cluster with Governor on board #26

Closed blelump closed 8 years ago

blelump commented 8 years ago

Hi,

I'm working on local psql cluster for horizontal scalability. I'm playing with Docker (docker-compose especially) and using @miketonks fork to achieve my goals and almost everything plays smoothly.

The cluster is built upon docker.compose.yml config file. When it starts for the first time, the election phase 'emits' master and slaves correctly, however when I shutdown the whole cluster (with etcd as well) and then start it once again, it's likely new master will be elected from a standby and it's just fine. The problem is with the old master (turned into standby), which basically crashes with logs:

LOG:  entering standby mode
FATAL:  requested timeline 3 is not a child of this server's history
DETAIL:  Latest checkpoint is at 0/6000028 on timeline 2, but in the history of the requested timeline, the server forked off from that timeline at 0/5014B50.
LOG:  startup process (PID 21) exited with exit code 1
LOG:  aborting startup due to startup process failure

I've read this article which explains what happens and how to recover from this. I'm reviewing the governor.py code, seeing the if else block and wondering how to recover safely the old master. I mean what was the purpose, assuming the data exists, that the old master node should follow_no_leader. Could you elaborate that?

Winslett commented 8 years ago

@blelump I was head deep in some different code and just saw your request for information.

The follow_no_leader on a returning member allows Postgres to run health checks on any member coming online before participating in the HA process. This returning member starts as a follower, and thus can request and apply any WAL logs and validate it is in a consistent state before participating.