Closed crioux-stripe closed 1 month ago
I've added a test showing even a 500 worker GroupToScalar
stage with 4 partitions (double the default) will not cause a collision under the new scheme.
Also cc @Andyz26 @hmit @calvin681 @sundargates you all probably have impacted jobs internally so this one should probably be high priority.
539 tests +3 533 :white_check_mark: +3 7m 54s :stopwatch: +14s 139 suites +1 6 :zzz: ±0 139 files +1 0 :x: ±0
Results for commit b5a81e6f. ± Comparison against base commit a59f5f65.
:recycle: This comment has been updated with latest results.
thanks a lot @crioux-stripe! I searched around, and I don't see impacted jobs on our side yet (my guess is that the multi-stage jobs we have tend to have very small stage 2+, so it's less likely to hit the collision).
Eliminates a repeatable hash collision with a 40 worker stage 2. We can either go with extending the bits in
ketama
or usingxxh3
. I'm recommendingxxh3
.Context
We ran into hash collisions when hashing slots on a 40 worker stage 2. This was a bit unexpected but very repeatable in the unit tests. If either of the unit tests added are run with the
ketama
hash function used in Mantis by default the tests will fail. One test will compute 40,000 hashes per call toroute
. The other will compute 39999 unique hashes from 40000 ring buffer entries. The values used are real values generated by a 40 worker stage 2 on aGroupToScalar
stage.Note for Netflix: I'd check logs from stage 1 of any of your jobs that have over 28 or more indices on the second stage. Its very likely that the difference in resource cluster types (you're using higher CPU to Network ratio boxes as well as Intel vs our ARM) is hiding the fact that some jobs might be doing ring recomputation on every chunk send. You should see the log line
Recomputing ring due to difference in number of connections versus cache
being spammed very frequently on impacted jobs.Our investigation shows:
Checklist
./gradlew build
compiles code correctly./gradlew test
passes all tests