citusdata / pg_shard

ATTENTION: pg_shard is superseded by Citus, its more powerful replacement
https://github.com/citusdata/citus
GNU Lesser General Public License v3.0
1.06k stars 63 forks source link

Fix hash token distribution algorithm #146

Closed jasonmp85 closed 8 years ago

jasonmp85 commented 8 years ago

There are 2^32 distinct "hash tokens" in our hash space, but we were using INT32_MAX (2^32 - 1) in the code instead. Because of this, shard counts which one might expect to divide evenly into the space (such as 16, 32, or 256) had fewer tokens than they should have. The remainder of the tokens were stuffed into the last shard, causing uneven load.

This fixes that by defining a 64-bit constant equal to 2^32 and then using that to calculate the distance between shards in the hash space.

Unlike #145, in "weird" shard counts, the first n-1 shards will have one size and the final shard will have all leftover tokens. This means the final shard could contain up to n-1 more hash tokens than the other shards, which might have load implications. In #145, all shards are within one token of each other.

Look at the tests in each review to get an idea of the two approaches.

ozgune commented 8 years ago

Quick question. Is this pull request good to go?

jasonmp85 commented 8 years ago

@ozgune I'm going to clean it up and merge it, yes. I'll get an update out to @onderkalaci for him to sign off on tonight.