EnterpriseDB / repmgr

A lightweight replication manager for PostgreSQL (Postgres)
https://repmgr.org/
Other
1.58k stars 252 forks source link

Who is the master? #259

Closed kitallis closed 1 month ago

kitallis commented 7 years ago

I've been working on this project https://github.com/staples-sparx/repmgr-to-zk that will help post the current master status to ZK/Kafka, so that HAProxy / the application can read the current master and refresh the connections. It's similar to https://github.com/compose/governor, except that it relies on repmgr for cluster management and failovers.

I'm parsing the cluster showcommand (from all nodes) to get the correct master, but I'm not sure if that's the latest source of truth. I'm also not entirely sure why the CSV output doesn't post the master information.

What is the best way to find out the latest master from repmgr?

saule1508 commented 7 years ago

I have the same issue. Ideally I want to rely on the repl_nodes table, checking for the record master and active='t' and assuming there is only one record. If it is not possible then I think I will loop on the nodes and check the result of pg_is_in_recovery() on each

I want to add some kind of protection against a split-brain (prevent the situation where two nodes are opened read-write). My idea so far is to add a ExecStartPre= in the postgres unit file, that would (1) use pg_ctl to start the DB (2) check pg_is_in_recovery. If true then it is OK to start, if false then look in the repl_nodes table if there is already a node master and active, if yes it is not OK. (3) stop DB with pg_ctl and let the unit continue or not depending on (2)

Maybe that is something that repmgrd could do ? When it sees a possible split-brain it could kill one of the master ?

abronan commented 6 years ago

You can't really rely on the output of repmgr cluster show as far as I'm aware, so your best bet will be to use zookeeper's distributed locking to implement leader election and manually trigger repmgr's commands depending on the node's status to match the current cluster state. This can be done with an external daemon connected to the distributed store and monitoring the local postgresql instance for example, executing the appropriate repmgr's commands. Which is pretty much what the Governor/Patroni projects are doing by using etcd's/consul/zookeeper lock and then configuring the replication.

This means you are also "safe" (I'm ignoring special cases and scenarios) in the case of a split brain, because the current leader in a partition will bail and give up leadership as it will not be able to update its key (and the TTL expires). Your app should also listen on the leader key to pick up the new leader and refresh its connection (or update the HAProxy configuration in case it is a daemon monitoring the leader key).

This diagram from the Patroni project really helps understanding the whole process.

gplv2 commented 6 years ago

see https://github.com/gplv2/haproxy-postgresql for haproxy pg_is_in_recovery checks. When you have 2 nodes that are not slaves but masters in their own right this haproxy config will cut connections to all machines as you have 2 hosts marked both as good, and you only want one. That means your master/standby are standalone and not replicating anymore.

I too wish I could poll the information from repmgr instead to be sure. The problem with using this haproxy configuration is that you cannot auto-start your databases.. Not autostarting will give you 1 automatic failover in case the primary goes doen, repmgr will (should) promote the standby but it's imperative that your old master pgsql does not start automatically (think hardware outtage) as this will make haproxy mark both as acceptable.

You can partly mitigate this using the backup directive but there is always a bias towards the primary hence you need to disable autostarting your databases (on all nodes in fact as their role can change). After that you have 2 options: 1) you investigate your old master to see what happened or 2) you don't care and use repmgr to clone the new master node and make the failing node a standby one.