Closed plafer closed 2 months ago
What would the take home message be from this optimization?
Rust inserts bound checks on every slice indexing when it's not able to prove that the indexing will not be out of bounds. For example, this will not result in any bound checks being inserted
for i in 0..v.len() {
v[i] = 3;
}
since i
clearly goes from 0
to v.len() - 1
. While Rust can figure out on its own that some indexing expression will never be out-of-bounds (e.g. v[i-1]
in a loop from 1
to v.len()
), it was not able to figure out that our self.evaluations[i << 1]
and self.evaluations[(i << 1) + 1]
would never be out of bounds, so it inserted out-of-bounds checks.
A good tool to figure that out is cargo-show-asm
, where you can inspect the generated assembly of a function that you specify. I'd recommend the bounds-check-cookbook for a good introduction on the subject. Otherwise, you can also just run a relevant benchmark and see if it improved.
Bound checks are bad for 2 reasons:
In our case, both versions of the code vectorized the load of self.evaluations[i << 1]
and self.evaluations[(i<< 1) + 1]
; the only difference between the 2 generated assemblies was the removal of 2 branches (due to bound checks) in the optimized version. Since this is done in a hot loop, I suppose it ended up costing a nontrivial amount. However, I'm hoping to find cases elsewhere where inserted bound checks also prevent the use of vectorized loads/stores, in which case we should see an even bigger performance boost.
Note that since with Goldilocks we're playing with "big" u64
values, and so the vectorization opportunities are typically limited to bundling two loads/stores together (ldp
and stp
on aarch64), since e.g. on aarch64 the vector registers are 128 bits wide. But this is basically the basis of the argument for using 32 bit-based fields :).
Makes sense, thank you for the great explanation!
This PR removes the slide bound checks in
MultiLinearPoly::bind_least_significant_variable
.In the serial case, we see a 7-9% improvement in the
bind-variable
benchmark, and sum-check performance improvement by 2.5-3.5%. Parallel performance doesn't change, presumably since the algorithm is bounded by the threading overhead rather than the time each threads spends looping over its chunk.The LogUp-GKR benchmark also didn't change, since
bind_least_significant_variable
was not a bottleneck. However I see this PR as a proof of concept, and will apply similar bound checks removal in any hot loop I find in a subsequent PR.Below are the serial bind-variable and sum-check benchmark results: