TritonVM / triton-vm

Triton is a virtual machine that comes with Algebraic Execution Tables (AET) and Arithmetic Intermediate Representations (AIR) for use in combination with a STARK proof system.
https://triton-vm.org
Apache License 2.0
247 stars 37 forks source link

Drop Cached Polynomials #330

Closed aszepieniec closed 1 month ago

aszepieniec commented 1 month ago

This PR drops the cached polynomials. They represent the same information as the original randomized trace, except in a different basis. If they are needed, the relevant randomized trace column is interpolated on the fly.

Results

I ran benchmark prove_fib to gauge performance difference between master and this branch. The results are a little confusing.

fib param master this branch
25000 20.5 Gb / 478 s 18 Gb / 603 s
50000 37.5 Gb / 1013 s 34.4 Gb / 1281 s
100000 out-of-memory out-of-memory

Interpretation

The cached polynomials seem to account for only a small fraction of the memory cost, whereas I expected it to be roughly half. The best explanation I can come up with is that the memory-efficient code path is never entered, in which case the low-degree extended trace is cached and this data is roughly 4x larger than the polynomials. But if the memory-efficient code path is never entered, then the slowdown is difficult to explain. (Profiles coming soon.)

If the memory-efficient code path is never entered in the first two rows, then why does the third crash? I would expect that if the low-degree extended trace is not cached, one saves 66% of the memory. So concretely, for growing padded table height I would expect the memory cost to grow until it selects the memory-efficient code path, at which point it drops before growing again, until it crashes. But that does not seem to be happening.

aszepieniec commented 1 month ago

Comparing profiles for Fib 100 one finds almost identical timings except for the out-of-domain rows. On master:

├─out-of-domain rows                            2.88s

on this branch:

├─out-of-domain rows                           36.50s

The discrepancy comes from having to use barycentric evaluation instead of Horner evaluation (since we do not have the polynomials anymore). If barycentric evaluation cannot be made any faster, perhaps we should consider caching the polynomials after all on the caching code path.

Sword-Smith commented 1 month ago

IIRC our current (host machine) barycentric evaluation uses memory allocation. In the PR that introduces (host machine) barycentric evaluation, I proposed an alternative version that doesn't allocate. It's worth trying that one to see if it solves the problem.

Sword-Smith commented 1 month ago

IIRC our current (host machine) barycentric evaluation uses memory allocation. In the PR that introduces (host machine) barycentric evaluation, I proposed an alternative version that doesn't allocate. It's worth trying that one to see if it solves the problem.

Try implementing barycentric evaluation like this instead. It avoids the allocation and might give a meaningful speedup.

The inner for-loop in that implementation can even be parallelized with a rayon-scan higher order primitive.

aszepieniec commented 1 month ago

96ca4bd5344729942c175bf680701513a5c76c1f Uses barycentric evaluation instead of coset extrapolation. Timings taken on my laptop for 10 samples of prove_fib with fibonacci parameter 10000.

master coset extrapolation barycentric
560.5s. 697.5s. 595.6s.

So barycentric is better than coset extrapolation, as expected. The case for caching the polynomials in the caching code path stands.

The mystery remains: why did the memory-efficient code path not get triggered during my benchmarks yesterday? (Or if it did get triggered, why was the memory savings so incremental?)

aszepieniec commented 1 month ago

7bb0ed1 uses a batching variant of barycentric evaluation. The relevant step (computing out-of-domain rows) seems to be faster than Horner, which requires cached polynomials. Unless I am mistaken, there is no reason to cache the polynomials for either codepath.

Some benchmarks, obtain on my laptop.

fib parameter version caching mem time
10 000 horner on 8.9 Gb 55 s
40 000 horner off 18.5 Gb 328 s
10 000 naïve barycentric off 3.5 Gb 104 s
40 000 naïve barycentric off 13 Gb 467 s
10 000 batch barycentric off 3.5 Gb 91 s
40 000 batch barycentric off 13.1 Gb 437 s
Sword-Smith commented 1 month ago

(...) Some benchmarks, obtain on my laptop. (...)

On mjolnir, only measuring the "out-of-domain rows" step:

fib parameter version time
40 000 horner 430.98ms
40 000 batch barycentric 535ms

So even though we see a small slowdown, I think it's worth it to save the RAM and drop the cached interpolants.

Edit: Unless we're using the interpolants elsewhere -- which I think we are!!

Sword-Smith commented 1 month ago

Proving with full caching

mjolnir

This PR:

Prove Fibonacci 40000   time:   [29.757 s 29.974 s 30.182 s]     

master

Prove Fibonacci 40000   time:   [29.104 s 29.216 s 29.317 s]
                        change: [-3.3020% -2.5289% -1.7164%] (p = 0.00 < 0.05)

So this PR introduces a 2.5 % slowdown on the fully cached code path. Reason being that some steps become more expensive when the interpolants have to be recalculated.