perf(fixed-deque): avoid costly modulo op in hot code path

Yomguithereal commented 5 months ago

@jerome-benoit did you benchmark this? You are indeed getting rid of the modulo operation but you are adding and if condition, so branching, which could also cost more than the modulo operation.

jerome-benoit commented 5 months ago

The fastpath in the modulo algorithm is doing more than one basic ops, use a loop on a condition (branch multiple times). I've benchmarked it some times ago in https://github.com/poolifier/poolifier/blob/master/benchmarks/worker-selection/round-robin.mjs.

The results are still on node 18.x:

~/SAPDevelop/poolifier-git/benchmarks/worker-selection (master ✔) ᐅ node round-robin.mjs 
Ternary off by one x 126,453,435 ops/sec ±0.11% (98 runs sampled)
Ternary with negation x 125,223,674 ops/sec ±0.16% (98 runs sampled)
Ternary with pre-choosing x 126,727,332 ops/sec ±0.12% (96 runs sampled)
Increment+Modulo x 126,534,434 ops/sec ±0.11% (100 runs sampled)
Fastest is Ternary with pre-choosing

Yomguithereal commented 5 months ago

@jerome-benoit I don't understand. You benchmark results are showing that there is no significant difference between any of the tested methods at all.

jerome-benoit commented 5 months ago

It's just a micro-optimisation that is consistently behaving because can't have slowpath like the euclidean division algo used in modulo. No more worst case and the usual case is still a little faster.

jerome-benoit commented 5 months ago

Results with an integer value triggering the worst case euclidian division code path:

~/SAPDevelop/poolifier-git/benchmarks/worker-selection (master ✔) ᐅ node round-robin.mjs
Ternary off by one x 126,177,714 ops/sec ±0.21% (95 runs sampled)
Ternary with negation x 125,207,891 ops/sec ±0.19% (98 runs sampled)
Ternary with pre-choosing x 124,669,268 ops/sec ±0.81% (97 runs sampled)
Increment+Modulo x 124,325,521 ops/sec ±0.44% (95 runs sampled)
Fastest is Ternary off by one

Results are consistent across run between values triggering the worst case or not on a non noisy env, unlike the one I'm currently using. That's was the base that made us used a ternary condition in the RR poolifier implementation some times ago. Latest node version seems to have a better euclidian division algo than in 16x and previous. But still, if a code path can avoid to trigger the euclidian division division algo, even done in C and optimized, I guess it's a good idea to replace it ;)

Yomguithereal / mnemonist

perf(fixed-deque): avoid costly modulo op in hot code path #221