Round-robin shard placement policy and shard rebalancer don't work well together

I'm adding an email thread from a customer interaction below. @metdos -- if you think this issue belongs to the shard rebalancer repo, could you move it there?

For rebalance, here's my notes:

Rebalance nodes algorithm could use work. Case:

Started with 6 nodes, 256 shards, replication factor = 2, four tables sharded on the same key (256 * 2 * 4 total shards).
Original IP addresses were 172.20.12.13 through 172.20.12.18.
For purposes of discussion, all four tables act the same so I'll track the 512 shards for a single table.
To start shards are evenly balanced as much as 512 div 6 will allow. Shards are "interleaved" across nodes (something along the lines of shard 101 on nodes 1/2, shard 102 on nodes 2/3, shard 103 on nodes 3/4, etc). I don't have a distribution query output for this state, but there are 6 "pairs" of nodes where the loss of both results in complete data loss for approximately 42 shards--the number that are exclusive to both servers ( 512 unique shards over 6 nodes is about 85 shards per node with about 42 exclusive to a node and either the previous or next adjacent node).
Added a 7th node (172.20.12.12), rebalanced - it moved X contiguous shards to the new node. Now the pairs of nodes and that pairs' count of tables exclusive to that pair (the counts total up to 256 shards since each line represents all copies of that shard at 2x replication):

nodepair                    | count 
----------------------------+-------
(172.20.12.12,172.20.12.13) | 10
(172.20.12.12,172.20.12.14) | 10
(172.20.12.12,172.20.12.15) | 11
(172.20.12.12,172.20.12.16) | 12
(172.20.12.12,172.20.12.17) | 11
(172.20.12.12,172.20.12.18) | 11
(172.20.12.13,172.20.12.14) | 32
(172.20.12.13,172.20.12.18) | 32
(172.20.12.14,172.20.12.15) | 32
(172.20.12.15,172.20.12.16) | 31
(172.20.12.16,172.20.12.17) | 32
(172.20.12.17,172.20.12.18) | 32
(12 rows)

Added an 8th node (172.20.12.19), rebalanced - it pulled basically the same X contiguous shards and resulted in this:

nodepair                    | count 
----------------------------+-------
(172.20.12.12,172.20.12.16) | 1
(172.20.12.12,172.20.12.18) | 1
(172.20.12.12,172.20.12.19) | 62
(172.20.12.13,172.20.12.14) | 32
(172.20.12.13,172.20.12.18) | 32
(172.20.12.14,172.20.12.15) | 32
(172.20.12.15,172.20.12.16) | 31
(172.20.12.15,172.20.12.19) | 1
(172.20.12.16,172.20.12.17) | 32
(172.20.12.17,172.20.12.18) | 32

Result - approximately a quarter of the shards only exist on two nodes (12 and 19) instead of being spread out across nodes. If I lost nodes 12 and 19, I would have to rebuild 25% of my data. Three of the pairs lost would only result in 1 shard needing to be rebuilt. It also causes a slight imbalance in load or storage as the 12/19 pair will have almost identical sets of data so any hot spots that occur in one will always occur in the other.

I would think that the best distribution would be to have (nodes)C(replication) pairs of nodes with an even number of exclusive shards. In my 8 node, 256 shard, 2x replication case, there would be 28 unique pairs of nodes possible, with each having only either 9 or 10 exclusive shards, minimizing data loss with the loss of two nodes and helping to ensure that load or storage doesn't cluster around a couple of nodes. I understand that this is not a simple problem, but I would think that a rebalance should move towards that goal rather than rapidly away from it.

Hey @ozgun,

We have multiple shard placement policies as stated in #358, but we don't keep this information per relation. Shard rebalancer needs to know the shard placement policy to respect it.

As an alternative solution to #358, we can define a shard placement policy per relation and whoever creates a new shard or moves an existing shard becomes responsible for following that policy.

It would be nice to have one unified shard placement policy, but looking to the email above, users would have different expectations which are:

i. Decrease the probability of any data loss with the loss of two nodes - (Round robin placement policy) ii. Minimize data loss with the loss of two nodes - (Random shard placement policy)

Let's think about the initial cluster above. There are 6 nodes and 15 different pairs of nodes.

i. Round-robin policy: 6 pairs of nodes will lose 42/43 shards, but other 9 combinations will not lose any shards in the case of two nodes failure.

ii. Random policy: Every pair of 15 nodes will lose 17/18 shards in the case of two nodes failure.

If you look carefully, you can see that you can't change the expected number of lost shards in the event of losing two nodes, you can only change how do you distribute the risk.

If losing 17 shards or 42 shards are same for you, you can go with round-robin policy (i) and have 60% less chance of losing any data in this example. If go with the random policy (ii), you increase the risk of data loss, but you minimize the data loss to 17 shards in this example.

The customer above wants (ii), but he gets (i) with the default round robin policy. The shard rebalancer just needs to know which policy is used, then it can respect to it.

citusdata / citus

Round-robin shard placement policy and shard rebalancer don't work well together #361