Netflix / mantis

A platform that makes it easy for developers to build realtime, cost-effective, operations-focused applications
Apache License 2.0
1.42k stars 202 forks source link

ConsistentHashingRouter: Use XXH3 instead of 4 byte ketama for consis… #715

Closed crioux-stripe closed 1 month ago

crioux-stripe commented 1 month ago

Eliminates a repeatable hash collision with a 40 worker stage 2. We can either go with extending the bits in ketama or using xxh3. I'm recommending xxh3.

Context

We ran into hash collisions when hashing slots on a 40 worker stage 2. This was a bit unexpected but very repeatable in the unit tests. If either of the unit tests added are run with the ketama hash function used in Mantis by default the tests will fail. One test will compute 40,000 hashes per call to route. The other will compute 39999 unique hashes from 40000 ring buffer entries. The values used are real values generated by a 40 worker stage 2 on a GroupToScalar stage.

Note for Netflix: I'd check logs from stage 1 of any of your jobs that have over 28 or more indices on the second stage. Its very likely that the difference in resource cluster types (you're using higher CPU to Network ratio boxes as well as Intel vs our ARM) is hiding the fact that some jobs might be doing ring recomputation on every chunk send. You should see the log line Recomputing ring due to difference in number of connections versus cache being spammed very frequently on impacted jobs.

Our investigation shows:

stage_2_index_28_partition_1-241 collided with: 
stage_2_index_2_partition_2-213
stage_2_index_6_partition_1-126 collided with: 
stage_2_index_34_partition_2-876

Checklist

crioux-stripe commented 1 month ago

I've added a test showing even a 500 worker GroupToScalar stage with 4 partitions (double the default) will not cause a collision under the new scheme.

crioux-stripe commented 1 month ago

Also cc @Andyz26 @hmit @calvin681 @sundargates you all probably have impacted jobs internally so this one should probably be high priority.

github-actions[bot] commented 1 month ago

Test Results

539 tests  +3   533 :white_check_mark: +3   7m 54s :stopwatch: +14s 139 suites +1     6 :zzz: ±0  139 files   +1     0 :x: ±0 

Results for commit b5a81e6f. ± Comparison against base commit a59f5f65.

:recycle: This comment has been updated with latest results.

Andyz26 commented 1 month ago

thanks a lot @crioux-stripe! I searched around, and I don't see impacted jobs on our side yet (my guess is that the multi-stage jobs we have tend to have very small stage 2+, so it's less likely to hit the collision).