compose / governor

Runners to orchestrate a high-availability PostgreSQL
MIT License
512 stars 75 forks source link

rewind ex-leader before joining again #38

Open debackerl opened 8 years ago

debackerl commented 8 years ago

Hello,

I'm no pg expert, but I'm a bit surprised you don't try to rewind an ex-leader trying to rejoin the cluster. That's why pg 9.5 introduced the pg_rewind command. Before that you basically had to use pg_basebackup.

The problem as I understood it is that when the leader is disconnected from the cluster but didn't had time to replicate the most recent pages, and secondary takes role of leader, the previous leader can't rejoin unless it "rewinds" its most recent pages because both ex and new leaders have diverged since last common point in time.

What's your take on this?

Thank you, Laurent Debacker

Winslett commented 8 years ago

This is a scenario this template chose not to address. pg_rewind is effectively deleting data. Every HA deployment has a different best-outcome for the times when pg_rewind would be used. Some want to save the data from the mis-fork, some are okay with overwriting the errors, some need STONITH more than pg_rewind.

In certain scenarios, I would argue that it's better for a database to be down than to have mis-forked data.

debackerl commented 8 years ago

Thank you for you fast response!

Even if you STONITH the ex-leader, if you used async replication, you may still have several megabytes in the replication queue when the ex-leader is finally shot. You will have no issue to use the new leader, but when you want to "recycle" the ex leader and bring it back online, I have read that either you wipe clean its data directory or use pg_rewind, otherwise there would be inconsistencies.

I guess you don't need pg_rewind if you use sync replication, because you know the secondary server will never lag behind.