Closed davissp14 closed 1 year ago
That seems reasonable. Is there a separation between voting members & read-only replicas? It seems like as long as you put your possible primary regions near each other then you should be fine. Spanning across countries or continents is going to be rough for a voting quorum no matter what.
Is there a separation between voting members & read-only replicas?
Not currently, but there can be! The main reason there is not currently is to address the possible misconfiguration of the PRIMARY_REGION
environment variable expressed above. It also allows users to run 2 in-region members + 1 out-of-region replica and still be able to meet quorum.
It seems like as long as you put your possible primary regions near each other then you should be fine. Spanning across countries or continents is going to be rough for a voting quorum no matter what.
So right now there should be only be a single primary region, which is reflected by the PRIMARY_REGION
environment variable. The main issue is that the PRIMARY_REGION
environment variable can be "misconfigured". A failed deploy, for example, may result in only a subset of the machines getting updated which under the right conditions could lead to a split-brain.
Context As of right now, quorum is calculated by evaluating each registered replica within the cluster and asking it who it thinks the current primary is.
We achieve two things by doing this:
The reason we evaluate every member within the cluster is because the primary member is unable to know how the
PRIMARY_REGION
value is configured per Machine. ThePRIMARY_REGION
environment variable represents the region holding the primary and dictates which members are eligible for promotion. This value should only be changed when a user needs to issue a regional failover, which requires the end-user to perform an update against every Machine in their cluster with the newPRIMARY_REGION
value. There are a number of ways this process could fail, which presents opportunities for inconsistency.Example case:
This issue can be expressed in a 6 node cluster with members split evenly across two different regions.
Nodes A,B,C residing in ord with
PRIMARY_REGION
set to ord Nodes D,E,F residing in iad withPRIMARY_REGION
set to iadIf quorum only considered "in-region" nodes, then it would be possible for a split-brain to occur across regions with the presence of a misconfigured
PRIMARY_REGION
environment variable.The problem with considering every registered member The problem with this approach is that we start to see an increase in connection timeouts when replicas are placed in regions considerably far away from the primary region. Connection timeouts are currently set to 5 seconds for each registered member. This issue is amplified if the replica is under load. When quorum cannot be reached, then the cluster will go read-only until quorum is restored. This can lead to a bad experience for many users who want to push many of their read-replicas into distant regions.