fly-apps / postgres-flex

Postgres HA setup using repmgr
67 stars 32 forks source link

Quorum calculation issues #191

Closed davissp14 closed 1 year ago

davissp14 commented 1 year ago

Context As of right now, quorum is calculated by evaluating each registered replica within the cluster and asking it who it thinks the current primary is.

We achieve two things by doing this:

  1. Confirm that the target members are reachable.
  2. We are building quorum by confirming that the majority of the cluster agree's on who the primary is.

The reason we evaluate every member within the cluster is because the primary member is unable to know how the PRIMARY_REGION value is configured per Machine. The PRIMARY_REGION environment variable represents the region holding the primary and dictates which members are eligible for promotion. This value should only be changed when a user needs to issue a regional failover, which requires the end-user to perform an update against every Machine in their cluster with the new PRIMARY_REGION value. There are a number of ways this process could fail, which presents opportunities for inconsistency.

Example case:

This issue can be expressed in a 6 node cluster with members split evenly across two different regions.

Nodes A,B,C residing in ord with PRIMARY_REGION set to ord Nodes D,E,F residing in iad with PRIMARY_REGION set to iad

If quorum only considered "in-region" nodes, then it would be possible for a split-brain to occur across regions with the presence of a misconfigured PRIMARY_REGION environment variable.

The problem with considering every registered member The problem with this approach is that we start to see an increase in connection timeouts when replicas are placed in regions considerably far away from the primary region. Connection timeouts are currently set to 5 seconds for each registered member. This issue is amplified if the replica is under load. When quorum cannot be reached, then the cluster will go read-only until quorum is restored. This can lead to a bad experience for many users who want to push many of their read-replicas into distant regions.

benbjohnson commented 1 year ago

That seems reasonable. Is there a separation between voting members & read-only replicas? It seems like as long as you put your possible primary regions near each other then you should be fine. Spanning across countries or continents is going to be rough for a voting quorum no matter what.

davissp14 commented 1 year ago

Is there a separation between voting members & read-only replicas?

Not currently, but there can be! The main reason there is not currently is to address the possible misconfiguration of the PRIMARY_REGION environment variable expressed above. It also allows users to run 2 in-region members + 1 out-of-region replica and still be able to meet quorum.

It seems like as long as you put your possible primary regions near each other then you should be fine. Spanning across countries or continents is going to be rough for a voting quorum no matter what.

So right now there should be only be a single primary region, which is reflected by the PRIMARY_REGION environment variable. The main issue is that the PRIMARY_REGION environment variable can be "misconfigured". A failed deploy, for example, may result in only a subset of the machines getting updated which under the right conditions could lead to a split-brain.