aszepieniec commented 1 month ago

This PR drops the cached polynomials. They represent the same information as the original randomized trace, except in a different basis. If they are needed, the relevant randomized trace column is interpolated on the fly.

Results

I ran benchmark prove_fib to gauge performance difference between master and this branch. The results are a little confusing.

fib param	master	this branch
25000	20.5 Gb / 478 s	18 Gb / 603 s
50000	37.5 Gb / 1013 s	34.4 Gb / 1281 s
100000	out-of-memory	out-of-memory

Interpretation

The cached polynomials seem to account for only a small fraction of the memory cost, whereas I expected it to be roughly half. The best explanation I can come up with is that the memory-efficient code path is never entered, in which case the low-degree extended trace is cached and this data is roughly 4x larger than the polynomials. But if the memory-efficient code path is never entered, then the slowdown is difficult to explain. (Profiles coming soon.)

If the memory-efficient code path is never entered in the first two rows, then why does the third crash? I would expect that if the low-degree extended trace is not cached, one saves 66% of the memory. So concretely, for growing padded table height I would expect the memory cost to grow until it selects the memory-efficient code path, at which point it drops before growing again, until it crashes. But that does not seem to be happening.

aszepieniec commented 1 month ago

Comparing profiles for Fib 100 one finds almost identical timings except for the out-of-domain rows. On master:

├─out-of-domain rows                            2.88s

on this branch:

├─out-of-domain rows                           36.50s

The discrepancy comes from having to use barycentric evaluation instead of Horner evaluation (since we do not have the polynomials anymore). If barycentric evaluation cannot be made any faster, perhaps we should consider caching the polynomials after all on the caching code path.

Sword-Smith commented 1 month ago

IIRC our current (host machine) barycentric evaluation uses memory allocation. In the PR that introduces (host machine) barycentric evaluation, I proposed an alternative version that doesn't allocate. It's worth trying that one to see if it solves the problem.

Sword-Smith commented 1 month ago

IIRC our current (host machine) barycentric evaluation uses memory allocation. In the PR that introduces (host machine) barycentric evaluation, I proposed an alternative version that doesn't allocate. It's worth trying that one to see if it solves the problem.

Try implementing barycentric evaluation like this instead. It avoids the allocation and might give a meaningful speedup.

The inner for-loop in that implementation can even be parallelized with a rayon-scan higher order primitive.

aszepieniec commented 1 month ago

96ca4bd5344729942c175bf680701513a5c76c1f Uses barycentric evaluation instead of coset extrapolation. Timings taken on my laptop for 10 samples of prove_fib with fibonacci parameter 10000.

master	coset extrapolation	barycentric
560.5s.	697.5s.	595.6s.

So barycentric is better than coset extrapolation, as expected. The case for caching the polynomials in the caching code path stands.

The mystery remains: why did the memory-efficient code path not get triggered during my benchmarks yesterday? (Or if it did get triggered, why was the memory savings so incremental?)

aszepieniec commented 1 month ago

7bb0ed1 uses a batching variant of barycentric evaluation. The relevant step (computing out-of-domain rows) seems to be faster than Horner, which requires cached polynomials. Unless I am mistaken, there is no reason to cache the polynomials for either codepath.

Some benchmarks, obtain on my laptop.

fib parameter	version	caching	mem	time
10 000	horner	on	8.9 Gb	55 s
40 000	horner	off	18.5 Gb	328 s
10 000	naïve barycentric	off	3.5 Gb	104 s
40 000	naïve barycentric	off	13 Gb	467 s
10 000	batch barycentric	off	3.5 Gb	91 s
40 000	batch barycentric	off	13.1 Gb	437 s

Sword-Smith commented 1 month ago

(...) Some benchmarks, obtain on my laptop. (...)

On mjolnir, only measuring the "out-of-domain rows" step:

fib parameter	version	time
40 000	horner	430.98ms
40 000	batch barycentric	535ms

~~So even though we see a small slowdown, I think it's worth it to save the RAM and drop the cached interpolants.~~

Edit: Unless we're using the interpolants elsewhere -- which I think we are!!

Sword-Smith commented 1 month ago

Proving with full caching

mjolnir

This PR:

Prove Fibonacci 40000   time:   [29.757 s 29.974 s 30.182 s]

master

Prove Fibonacci 40000   time:   [29.104 s 29.216 s 29.317 s]
                        change: [-3.3020% -2.5289% -1.7164%] (p = 0.00 < 0.05)

So this PR introduces a 2.5 % slowdown on the fully cached code path. Reason being that some steps become more expensive when the interpolants have to be recalculated.

TritonVM / triton-vm

Drop Cached Polynomials #330

Results

Interpretation