Closed aszepieniec closed 1 month ago
Comparing profiles for Fib 100 one finds almost identical timings except for the out-of-domain rows. On master:
├─out-of-domain rows 2.88s
on this branch:
├─out-of-domain rows 36.50s
The discrepancy comes from having to use barycentric evaluation instead of Horner evaluation (since we do not have the polynomials anymore). If barycentric evaluation cannot be made any faster, perhaps we should consider caching the polynomials after all on the caching code path.
IIRC our current (host machine) barycentric evaluation uses memory allocation. In the PR that introduces (host machine) barycentric evaluation, I proposed an alternative version that doesn't allocate. It's worth trying that one to see if it solves the problem.
IIRC our current (host machine) barycentric evaluation uses memory allocation. In the PR that introduces (host machine) barycentric evaluation, I proposed an alternative version that doesn't allocate. It's worth trying that one to see if it solves the problem.
Try implementing barycentric evaluation like this instead. It avoids the allocation and might give a meaningful speedup.
The inner for-loop in that implementation can even be parallelized with a rayon-scan higher order primitive.
96ca4bd5344729942c175bf680701513a5c76c1f Uses barycentric evaluation instead of coset extrapolation. Timings taken on my laptop for 10 samples of prove_fib
with fibonacci parameter 10000.
master | coset extrapolation | barycentric |
---|---|---|
560.5s. | 697.5s. | 595.6s. |
So barycentric is better than coset extrapolation, as expected. The case for caching the polynomials in the caching code path stands.
The mystery remains: why did the memory-efficient code path not get triggered during my benchmarks yesterday? (Or if it did get triggered, why was the memory savings so incremental?)
7bb0ed1 uses a batching variant of barycentric evaluation. The relevant step (computing out-of-domain rows) seems to be faster than Horner, which requires cached polynomials. Unless I am mistaken, there is no reason to cache the polynomials for either codepath.
Some benchmarks, obtain on my laptop.
fib parameter | version | caching | mem | time |
---|---|---|---|---|
10 000 | horner | on | 8.9 Gb | 55 s |
40 000 | horner | off | 18.5 Gb | 328 s |
10 000 | naïve barycentric | off | 3.5 Gb | 104 s |
40 000 | naïve barycentric | off | 13 Gb | 467 s |
10 000 | batch barycentric | off | 3.5 Gb | 91 s |
40 000 | batch barycentric | off | 13.1 Gb | 437 s |
(...) Some benchmarks, obtain on my laptop. (...)
On mjolnir, only measuring the "out-of-domain rows" step:
fib parameter | version | time |
---|---|---|
40 000 | horner | 430.98ms |
40 000 | batch barycentric | 535ms |
So even though we see a small slowdown, I think it's worth it to save the RAM and drop the cached interpolants.
Edit: Unless we're using the interpolants elsewhere -- which I think we are!!
Proving with full caching
mjolnir
This PR:
Prove Fibonacci 40000 time: [29.757 s 29.974 s 30.182 s]
master
Prove Fibonacci 40000 time: [29.104 s 29.216 s 29.317 s]
change: [-3.3020% -2.5289% -1.7164%] (p = 0.00 < 0.05)
So this PR introduces a 2.5 % slowdown on the fully cached code path. Reason being that some steps become more expensive when the interpolants have to be recalculated.
This PR drops the cached polynomials. They represent the same information as the original randomized trace, except in a different basis. If they are needed, the relevant randomized trace column is interpolated on the fly.
Results
I ran benchmark
prove_fib
to gauge performance difference betweenmaster
and this branch. The results are a little confusing.Interpretation
The cached polynomials seem to account for only a small fraction of the memory cost, whereas I expected it to be roughly half. The best explanation I can come up with is that the memory-efficient code path is never entered, in which case the low-degree extended trace is cached and this data is roughly 4x larger than the polynomials. But if the memory-efficient code path is never entered, then the slowdown is difficult to explain. (Profiles coming soon.)
If the memory-efficient code path is never entered in the first two rows, then why does the third crash? I would expect that if the low-degree extended trace is not cached, one saves 66% of the memory. So concretely, for growing padded table height I would expect the memory cost to grow until it selects the memory-efficient code path, at which point it drops before growing again, until it crashes. But that does not seem to be happening.