citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.53k stars 667 forks source link

Round-robin shard placement policy and shard rebalancer don't work well together #361

Open ozgune opened 8 years ago

ozgune commented 8 years ago

I'm adding an email thread from a customer interaction below. @metdos -- if you think this issue belongs to the shard rebalancer repo, could you move it there?

For rebalance, here's my notes:

Rebalance nodes algorithm could use work. Case:

nodepair                    | count 
----------------------------+-------
(172.20.12.12,172.20.12.13) | 10
(172.20.12.12,172.20.12.14) | 10
(172.20.12.12,172.20.12.15) | 11
(172.20.12.12,172.20.12.16) | 12
(172.20.12.12,172.20.12.17) | 11
(172.20.12.12,172.20.12.18) | 11
(172.20.12.13,172.20.12.14) | 32
(172.20.12.13,172.20.12.18) | 32
(172.20.12.14,172.20.12.15) | 32
(172.20.12.15,172.20.12.16) | 31
(172.20.12.16,172.20.12.17) | 32
(172.20.12.17,172.20.12.18) | 32
(12 rows)
nodepair                    | count 
----------------------------+-------
(172.20.12.12,172.20.12.16) | 1
(172.20.12.12,172.20.12.18) | 1
(172.20.12.12,172.20.12.19) | 62
(172.20.12.13,172.20.12.14) | 32
(172.20.12.13,172.20.12.18) | 32
(172.20.12.14,172.20.12.15) | 32
(172.20.12.15,172.20.12.16) | 31
(172.20.12.15,172.20.12.19) | 1
(172.20.12.16,172.20.12.17) | 32
(172.20.12.17,172.20.12.18) | 32

I would think that the best distribution would be to have (nodes)C(replication) pairs of nodes with an even number of exclusive shards. In my 8 node, 256 shard, 2x replication case, there would be 28 unique pairs of nodes possible, with each having only either 9 or 10 exclusive shards, minimizing data loss with the loss of two nodes and helping to ensure that load or storage doesn't cluster around a couple of nodes. I understand that this is not a simple problem, but I would think that a rebalance should move towards that goal rather than rapidly away from it.

metdos commented 8 years ago

Hey @ozgun,

We have multiple shard placement policies as stated in #358, but we don't keep this information per relation. Shard rebalancer needs to know the shard placement policy to respect it.

As an alternative solution to #358, we can define a shard placement policy per relation and whoever creates a new shard or moves an existing shard becomes responsible for following that policy.

It would be nice to have one unified shard placement policy, but looking to the email above, users would have different expectations which are:

i. Decrease the probability of any data loss with the loss of two nodes - (Round robin placement policy) ii. Minimize data loss with the loss of two nodes - (Random shard placement policy)

Let's think about the initial cluster above. There are 6 nodes and 15 different pairs of nodes.

i. Round-robin policy: 6 pairs of nodes will lose 42/43 shards, but other 9 combinations will not lose any shards in the case of two nodes failure.

ii. Random policy: Every pair of 15 nodes will lose 17/18 shards in the case of two nodes failure.

If you look carefully, you can see that you can't change the expected number of lost shards in the event of losing two nodes, you can only change how do you distribute the risk.

If losing 17 shards or 42 shards are same for you, you can go with round-robin policy (i) and have 60% less chance of losing any data in this example. If go with the random policy (ii), you increase the risk of data loss, but you minimize the data loss to 17 shards in this example.

The customer above wants (ii), but he gets (i) with the default round robin policy. The shard rebalancer just needs to know which policy is used, then it can respect to it.