Optimize `bind_least_significant_variable`

plafer commented 2 months ago

This PR removes the slide bound checks in MultiLinearPoly::bind_least_significant_variable.

In the serial case, we see a 7-9% improvement in the bind-variable benchmark, and sum-check performance improvement by 2.5-3.5%. Parallel performance doesn't change, presumably since the algorithm is bounded by the threading overhead rather than the time each threads spends looping over its chunk.

The LogUp-GKR benchmark also didn't change, since bind_least_significant_variable was not a bottleneck. However I see this PR as a proof of concept, and will apply similar bound checks removal in any hot loop I find in a subsequent PR.

Below are the serial bind-variable and sum-check benchmark results:

bind variable //262144  time:   [219.29 µs 225.73 µs 231.25 µs]
                        change: [-9.5833% -6.9170% -3.6918%] (p = 0.00 < 0.05)
                        Performance has improved.
bind variable //1048576 time:   [877.21 µs 907.05 µs 923.14 µs]
                        change: [-12.616% -8.9228% -4.7734%] (p = 0.00 < 0.05)
                        Performance has improved.

Sum-check prover high degree//18
                        time:   [26.296 ms 26.391 ms 26.504 ms]
                        change: [-3.3216% -2.7946% -2.2863%] (p = 0.00 < 0.05)
                        Performance has improved.
Sum-check prover high degree//20
                        time:   [100.89 ms 101.80 ms 102.65 ms]
                        change: [-3.6253% -2.6530% -1.6493%] (p = 0.00 < 0.05)
                        Performance has improved.

Sum-check prover plain//18
                        time:   [8.1104 ms 8.1457 ms 8.2036 ms]
                        change: [-4.6356% -3.6520% -2.6455%] (p = 0.00 < 0.05)
                        Performance has improved.
Sum-check prover plain//20
                        time:   [32.213 ms 32.604 ms 32.995 ms]
                        change: [-3.9200% -2.2554% -0.5909%] (p = 0.03 < 0.05)
                        Change within noise threshold.

plafer commented 2 months ago

What would the take home message be from this optimization?

Rust inserts bound checks on every slice indexing when it's not able to prove that the indexing will not be out of bounds. For example, this will not result in any bound checks being inserted

for i in 0..v.len() {
  v[i] = 3;
}

since i clearly goes from 0 to v.len() - 1. While Rust can figure out on its own that some indexing expression will never be out-of-bounds (e.g. v[i-1] in a loop from 1 to v.len()), it was not able to figure out that our self.evaluations[i << 1] and self.evaluations[(i << 1) + 1] would never be out of bounds, so it inserted out-of-bounds checks.

A good tool to figure that out is cargo-show-asm, where you can inspect the generated assembly of a function that you specify. I'd recommend the bounds-check-cookbook for a good introduction on the subject. Otherwise, you can also just run a relevant benchmark and see if it improved.

Bound checks are bad for 2 reasons:

it inserts a useless branch into the code
can prevent auto-vectorization (i.e. automatic generation of SIMD instructions by the compiler)

In our case, both versions of the code vectorized the load of self.evaluations[i << 1] and self.evaluations[(i<< 1) + 1]; the only difference between the 2 generated assemblies was the removal of 2 branches (due to bound checks) in the optimized version. Since this is done in a hot loop, I suppose it ended up costing a nontrivial amount. However, I'm hoping to find cases elsewhere where inserted bound checks also prevent the use of vectorized loads/stores, in which case we should see an even bigger performance boost.

Note that since with Goldilocks we're playing with "big" u64 values, and so the vectorization opportunities are typically limited to bundling two loads/stores together (ldp and stp on aarch64), since e.g. on aarch64 the vector registers are 128 bits wide. But this is basically the basis of the argument for using 32 bit-based fields :).

Al-Kindi-0 commented 2 months ago

Makes sense, thank you for the great explanation!

facebook / winterfell

Optimize `bind_least_significant_variable` #319