TritonVM / triton-vm

Triton is a virtual machine that comes with Algebraic Execution Tables (AET) and Arithmetic Intermediate Representations (AIR) for use in combination with a STARK proof system.
https://triton-vm.org
Apache License 2.0
247 stars 37 forks source link

Use randomness more efficiently #335

Closed jan-ferdinand closed 2 weeks ago

jan-ferdinand commented 1 month ago

Remove the “randomized trace table,” which stored a copy of the entire trace interleaved with randomness.

jan-ferdinand commented 1 month ago

My measurements of proving the program fibonacci with input 10000 indicate that your remarks regarding growth of the memory requirements in the quotienting step are accurate. However, they are offset by savings in other places, making the suggested changes zero-sum with respect to memory consumption. Note that the total numbers are rounded very liberally, while the percentages provide higher precision.

with explicit randomizers (old)

operation abs [GB] relative
total 3.2 100.0%
compute_quotient_segments 1.5 46.7%
MainTable::new 0.8 24.8%
MainTable::extend 0.5 17.2%

without explicit randomizers (new)

operation abs [GB] relative
total 3.2 100.0%
compute_quotient_segments 1.8 57.0%
MainTable::new 0.4 12.4%
MainTable::extend 0.3 8.6%

That said, it is worth some additional effort to bring compute_quotient_segments back down again, netting a win.

jan-ferdinand commented 1 month ago

With the most recent changes, we now have:

basically “Workflow 2” (new new)

operation abs [GB] relative
total 2.5 100.0%
compute_quotient_segments 1.4 53.6%
MainTable::new 0.4 15.7%
MainTable::extend 0.3 10.9%

The total has dropped by 0.7 GB (~20%). The difference in compute_quotient_segments is a measurable 0.4GB (~12%) but does not account for the entire delta, which I find surprising. It could be that the call graph is somewhat obscured due to parallelization, which is now used slightly differently.

jan-ferdinand commented 3 weeks ago

With the most recent changes, we now have:

basically “Workflow 3” (new new new)

operation abs [GB] relative
total 1.8 100.0%
compute_quotient_segments 0.8 44.9%
MainTable::new 0.4 21.6%
MainTable::extend 0.3 15.1%

The total has dropped by another 0.7 GB (~28%), all of which seems to be from savings in compute_quotient_segments.

Additionally, my benchmarks indicate that runtime performance has improved a tad.

jan-ferdinand commented 2 weeks ago

do tell please, what's the magnitude of tad?

The runtime performance gain of the memory efficient path that comes with this PR[^tad] is 15%. :tada:

Profile at branch's base


Prove Fibonacci 10000   time:   [10.097 s 10.112 s 10.128 s]

### Prove Fibonacci 10000 111.46s #Reps Share Category ├─Fiat-Shamir: claim 100.63µs 11 0.00% (hash – 0.00%) ├─derive additional parameters 130.27µs 11 0.00% ├─main tables 33.10s 11 29.70% │ ├─create 3.61s 11 3.24% (gen – 73.94%) │ ├─pad 517.12ms 11 0.46% (gen – 10.58%) │ │ ├─pad original tables 256.78ms 11 0.23% │ │ └─fill degree-lowering table 260.23ms 11 0.23% │ ├─randomize trace 143.19ms 11 0.13% (gen – 2.93%) │ ├─LDE 12.18µs 11 0.00% (LDE – 0.00%) │ ├─Merkle tree 28.33s 11 25.42% (hash – 46.16%) │ │ ├─leafs 27.63s 11 24.78% │ │ └─Merkle tree 702.06ms 11 0.63% │ ├─Fiat-Shamir 340.85µs 11 0.00% (hash – 0.00%) │ └─extend 493.85ms 11 0.44% (gen – 10.11%) │ ├─initialize master table 253.06ms 11 0.23% │ ├─slice master table 95.73µs 11 0.00% │ ├─all tables 120.70ms 11 0.11% │ └─fill degree lowering table 119.82ms 11 0.11% ├─aux tables 32.00s 11 28.71% │ ├─randomize trace 119.03ms 11 0.11% (gen – 2.44%) │ ├─LDE 12.79µs 11 0.00% (LDE – 0.00%) │ ├─Merkle tree 31.88s 11 28.61% (hash – 51.95%) │ │ ├─leafs 31.22s 11 28.01% │ │ └─Merkle tree 660.74ms 11 0.59% │ └─Fiat-Shamir 1.72ms 11 0.00% (hash – 0.00%) ├─quotient calculation (just-in-time) 30.65s 11 27.50% │ ├─zero-initialization 592.32ms 11 0.53% │ ├─calculate quotients 26.37s 11 23.66% │ │ ├─LDE 18.16s 44 16.29% (LDE – 100.00%) │ │ └─AIR evaluation 8.22s 44 7.37% (AIR – 100.00%) │ │ ├─zerofier inverse 1.24s 44 1.11% │ │ └─evaluate AIR, compute quotient codeword 6.96s 44 6.24% │ └─segmentify 2.67s 11 2.40% ├─hash rows of quotient segments 497.55ms 11 0.45% (hash – 0.81%) ├─Merkle tree 657.56ms 11 0.59% (hash – 1.07%) ├─out-of-domain rows 2.75s 11 2.46% ├─Fiat-Shamir 1.00ms 11 0.00% (hash – 0.00%) ├─linear combination 2.81s 11 2.52% │ ├─main 133.07ms 11 0.12% (CC – 8.01%) │ ├─aux 91.31ms 11 0.08% (CC – 5.50%) │ └─quotient 1.23s 11 1.11% (CC – 74.27%) ├─DEEP 1.12s 11 1.01% │ ├─main&aux curr row 358.60ms 11 0.32% │ ├─main&aux next row 381.28ms 11 0.34% │ └─segmented quotient 382.62ms 11 0.34% ├─combined DEEP polynomial 203.16ms 11 0.18% │ └─sum 203.12ms 11 0.18% (CC – 12.23%) ├─FRI 2.30s 11 2.06% └─open trace leafs 3.83s 11 3.43%
### Categories hash 61.37s 55.06% LDE 18.16s 16.29% AIR 8.22s 7.37% gen 4.89s 4.38% CC 1.66s 1.49%
Clock frequency is 9869 Hz (100009 clock cycles / (111459 ms / 11 iterations)) Optimal clock frequency is 12935 Hz (131072 padded height / (111459 ms / 11 iterations)) FRI domain length is 2^20

Profile at branch's tip


Prove Fibonacci 10000   time:   [8.5598 s 8.5781 s 8.5966 s]
                        change: [-15.388% -15.170% -14.928%] (p = 0.00 < 0.05)
                        Performance has improved.

### Prove Fibonacci 10000 94.60s #Reps Share Category ├─Fiat-Shamir: claim 80.16µs 11 0.00% (hash – 0.00%) ├─derive additional parameters 125.57µs 11 0.00% ├─main tables 29.07s 11 30.73% │ ├─create 451.81ms 11 0.48% (gen – 36.97%) │ ├─pad 435.43ms 11 0.46% (gen – 35.63%) │ │ ├─pad original tables 240.52ms 11 0.25% │ │ └─fill degree-lowering table 194.80ms 11 0.21% │ ├─Merkle tree 27.85s 11 29.44% │ │ ├─leafs 27.14s 11 28.69% │ │ │ ├─LDE 14.96s 33 15.81% (LDE – 41.64%) │ │ │ └─hash rows 5.45s 33 5.76% (hash – 43.73%) │ │ └─Merkle tree 700.77ms 11 0.74% (hash – 5.63%) │ ├─Fiat-Shamir 336.35µs 11 0.00% (hash – 0.00%) │ └─extend 335.01ms 11 0.35% (gen – 27.41%) │ ├─initialize master table 150.29ms 11 0.16% │ ├─slice master table 62.92µs 11 0.00% │ ├─all tables 97.81ms 11 0.10% │ └─fill degree lowering table 86.70ms 11 0.09% ├─aux tables 31.09s 11 32.87% │ ├─Merkle tree 31.09s 11 32.86% │ │ ├─leafs 30.42s 11 32.16% │ │ │ ├─LDE 9.61s 11 10.15% (LDE – 26.74%) │ │ │ └─hash rows 4.51s 11 4.77% (hash – 36.21%) │ │ └─Merkle tree 665.64ms 11 0.70% (hash – 5.35%) │ └─Fiat-Shamir 1.83ms 11 0.00% (hash – 0.01%) ├─quotient calculation (just-in-time) 22.62s 11 23.92% │ ├─zero-initialization 332.69ms 11 0.35% │ ├─fetch trace randomizers 3.97ms 11 0.00% │ ├─poly interpolate 696.45ms 11 0.74% (LDE – 1.94%) │ ├─calculate quotients 18.27s 11 19.32% │ │ ├─poly evaluate 4.71s 88 4.98% (LDE – 13.11%) │ │ ├─trace randomizers 5.28s 88 5.58% (LDE – 14.69%) │ │ └─AIR evaluation 8.28s 88 8.76% (AIR – 100.00%) │ │ ├─zerofier inverse 1.30s 88 1.37% │ │ └─evaluate AIR, compute quotient codeword 6.96s 88 7.36% │ ├─segmentify 2.12s 11 2.24% │ └─restore original trace 676.69ms 11 0.72% (LDE – 1.88%) ├─hash rows of quotient segments 469.03ms 11 0.50% (hash – 3.77%) ├─Merkle tree 659.00ms 11 0.70% (hash – 5.29%) ├─out-of-domain rows 1.38s 11 1.46% ├─Fiat-Shamir 1.01ms 11 0.00% (hash – 0.01%) ├─linear combination 2.69s 11 2.84% │ ├─main 220.19ms 11 0.23% (CC – 12.08%) │ ├─aux 193.60ms 11 0.20% (CC – 10.62%) │ └─quotient 1.19s 11 1.26% (CC – 65.39%) ├─DEEP 1.11s 11 1.18% │ ├─main&aux curr row 354.48ms 11 0.37% │ ├─main&aux next row 369.62ms 11 0.39% │ └─segmented quotient 387.42ms 11 0.41% ├─combined DEEP polynomial 217.43ms 11 0.23% │ └─sum 217.39ms 11 0.23% (CC – 11.92%) ├─FRI 2.34s 11 2.47% └─open trace leafs 1.91s 11 2.02% └─recompute rows 1.91s 22 2.02%
### Categories LDE 35.93s 37.98% hash 12.45s 13.16% AIR 8.28s 8.76% CC 1.82s 1.93% gen 1.22s 1.29%
Clock frequency is 11629 Hz (100009 clock cycles / (94596 ms / 11 iterations)) Optimal clock frequency is 15241 Hz (131072 padded height / (94596 ms / 11 iterations)) FRI domain length is 2^20

A substantial amount of the savings are from witness generation. This makes sense, as the tables held in memory are now half as large, and we now only generate exactly as much randomness as required for ZK.

Note that the large shift away from the category “hash” is due to now-correct annotations.

[^tad]: i.e., not exactly the “tad” from above, which was talking about the performance improvements of the latest commit, which are less relevant in the grand scheme of this PR

jan-ferdinand commented 2 weeks ago

The runtime performance gain of the caching path that comes with this PR is 10%. :tada:

Profile at branch's base

Prove Fibonacci 10000   time:   [6.5011 s 6.5121 s 6.5221 s]

### Prove Fibonacci 10000 72.05s #Reps Share Category ├─Fiat-Shamir: claim 111.03µs 11 0.00% (hash – 0.00%) ├─derive additional parameters 121.68µs 11 0.00% ├─main tables 27.90s 11 38.72% │ ├─create 3.53s 11 4.90% (gen – 73.90%) │ ├─pad 489.70ms 11 0.68% (gen – 10.25%) │ │ ├─pad original tables 237.87ms 11 0.33% │ │ └─fill degree-lowering table 251.72ms 11 0.35% │ ├─randomize trace 141.57ms 11 0.20% (gen – 2.96%) │ ├─LDE 15.69s 11 21.78% (LDE – 51.19%) │ │ ├─polynomial zero-initialization 35.95µs 11 0.00% │ │ ├─interpolation 970.66ms 11 1.35% │ │ ├─resize 1.26s 11 1.75% │ │ ├─evaluation 13.29s 11 18.45% │ │ └─memoize 8.03µs 11 0.00% │ ├─Merkle tree 7.54s 11 10.47% (hash – 50.06%) │ │ ├─leafs 6.87s 11 9.53% │ │ └─Merkle tree 669.20ms 11 0.93% │ ├─Fiat-Shamir 334.26µs 11 0.00% (hash – 0.00%) │ └─extend 499.51ms 11 0.69% (gen – 10.45%) │ ├─initialize master table 256.70ms 11 0.36% │ ├─slice master table 65.68µs 11 0.00% │ ├─all tables 120.33ms 11 0.17% │ └─fill degree lowering table 122.27ms 11 0.17% ├─aux tables 17.84s 11 24.76% │ ├─randomize trace 116.29ms 11 0.16% (gen – 2.43%) │ ├─LDE 11.36s 11 15.77% (LDE – 37.07%) │ │ ├─polynomial zero-initialization 21.19µs 11 0.00% │ │ ├─interpolation 1.15s 11 1.59% │ │ ├─resize 880.49ms 11 1.22% │ │ ├─evaluation 9.34s 11 12.96% │ │ └─memoize 9.10µs 11 0.00% │ ├─Merkle tree 6.36s 11 8.83% (hash – 42.23%) │ │ ├─leafs 5.69s 11 7.90% │ │ └─Merkle tree 667.54ms 11 0.93% │ └─Fiat-Shamir 1.64ms 11 0.00% (hash – 0.01%) ├─quotient calculation (cached) 6.88s 11 9.54% (CC – 80.43%) │ ├─zerofier inverse 992.21ms 11 1.38% │ └─evaluate AIR, compute quotient codeword 5.88s 11 8.17% ├─quotient LDE 3.60s 11 4.99% (LDE – 11.74%) ├─hash rows of quotient segments 493.16ms 11 0.68% (hash – 3.27%) ├─Merkle tree 664.39ms 11 0.92% (hash – 4.41%) ├─out-of-domain rows 2.74s 11 3.80% ├─Fiat-Shamir 1.02ms 11 0.00% (hash – 0.01%) ├─linear combination 2.83s 11 3.92% │ ├─main 134.71ms 11 0.19% (CC – 1.58%) │ ├─aux 93.65ms 11 0.13% (CC – 1.10%) │ └─quotient 1.24s 11 1.72% (CC – 14.50%) ├─DEEP 1.14s 11 1.58% │ ├─main&aux curr row 360.76ms 11 0.50% │ ├─main&aux next row 384.33ms 11 0.53% │ └─segmented quotient 390.91ms 11 0.54% ├─combined DEEP polynomial 205.40ms 11 0.29% │ └─sum 205.36ms 11 0.29% (CC – 2.40%) ├─FRI 2.40s 11 3.33% └─open trace leafs 7.62ms 11 0.01%
### Categories LDE 30.65s 42.55% hash 15.06s 20.90% CC 8.55s 11.87% gen 4.78s 6.63%
Clock frequency is 15269 Hz (100009 clock cycles / (72046 ms / 11 iterations)) Optimal clock frequency is 20012 Hz (131072 padded height / (72046 ms / 11 iterations)) FRI domain length is 2^20

Profile at branch's tip

Prove Fibonacci 10000   time:   [5.8572 s 5.8721 s 5.8835 s]
                        change: [-10.083% -9.8281% -9.5672%] (p = 0.00 < 0.05)
                        Performance has improved.

### Prove Fibonacci 10000 64.80s #Reps Share Category ├─Fiat-Shamir: claim 117.95µs 11 0.00% (hash – 0.00%) ├─derive additional parameters 136.39µs 11 0.00% ├─main tables 23.77s 11 36.68% │ ├─create 443.93ms 11 0.69% (gen – 36.74%) │ ├─pad 437.82ms 11 0.68% (gen – 36.24%) │ │ ├─pad original tables 238.96ms 11 0.37% │ │ └─fill degree-lowering table 198.74ms 11 0.31% │ ├─LDE 15.05s 11 23.23% (LDE – 51.17%) │ │ ├─polynomial zero-initialization 25.70µs 11 0.00% │ │ ├─interpolation 637.42ms 11 0.98% │ │ ├─resize 1.25s 11 1.94% │ │ ├─evaluation 13.16s 11 20.31% │ │ └─memoize 8.25µs 11 0.00% │ ├─Merkle tree 7.51s 11 11.58% │ │ ├─leafs 6.84s 11 10.56% │ │ │ └─hash rows 6.84s 11 10.56% (hash – 45.82%) │ │ └─Merkle tree 661.81ms 11 1.02% (hash – 4.43%) │ ├─Fiat-Shamir 351.57µs 11 0.00% (hash – 0.00%) │ └─extend 326.39ms 11 0.50% (gen – 27.02%) │ ├─initialize master table 148.76ms 11 0.23% │ ├─slice master table 61.77µs 11 0.00% │ ├─all tables 93.12ms 11 0.14% │ └─fill degree lowering table 84.30ms 11 0.13% ├─aux tables 16.89s 11 26.07% │ ├─LDE 10.61s 11 16.37% (LDE – 36.06%) │ │ ├─polynomial zero-initialization 18.20µs 11 0.00% │ │ ├─interpolation 513.83ms 11 0.79% │ │ ├─resize 879.87ms 11 1.36% │ │ ├─evaluation 9.21s 11 14.22% │ │ └─memoize 7.91µs 11 0.00% │ ├─Merkle tree 6.28s 11 9.69% │ │ ├─leafs 5.61s 11 8.66% │ │ │ └─hash rows 5.61s 11 8.66% (hash – 37.59%) │ │ └─Merkle tree 665.64ms 11 1.03% (hash – 4.46%) │ └─Fiat-Shamir 1.70ms 11 0.00% (hash – 0.01%) ├─quotient calculation (cached) 6.80s 11 10.50% (CC – 79.12%) │ ├─zerofier inverse 987.79ms 11 1.52% │ └─evaluate AIR, compute quotient codeword 5.81s 11 8.97% ├─quotient LDE 3.76s 11 5.80% (LDE – 12.77%) ├─hash rows of quotient segments 495.65ms 11 0.76% (hash – 3.32%) ├─Merkle tree 650.83ms 11 1.00% (hash – 4.36%) ├─out-of-domain rows 1.31s 11 2.02% ├─Fiat-Shamir 1.00ms 11 0.00% (hash – 0.01%) ├─linear combination 2.73s 11 4.22% │ ├─main 209.11ms 11 0.32% (CC – 2.43%) │ ├─aux 184.97ms 11 0.29% (CC – 2.15%) │ └─quotient 1.20s 11 1.85% (CC – 13.93%) ├─DEEP 1.09s 11 1.69% │ ├─main&aux curr row 350.01ms 11 0.54% │ ├─main&aux next row 370.70ms 11 0.57% │ └─segmented quotient 372.15ms 11 0.57% ├─combined DEEP polynomial 202.92ms 11 0.31% │ └─sum 202.88ms 11 0.31% (CC – 2.36%) ├─FRI 2.25s 11 3.47% └─open trace leafs 7.42ms 11 0.01%
### Categories LDE 29.41s 45.39% hash 14.93s 23.04% CC 8.60s 13.26% gen 1.21s 1.86%
Clock frequency is 16976 Hz (100009 clock cycles / (64800 ms / 11 iterations)) Optimal clock frequency is 22249 Hz (131072 padded height / (64800 ms / 11 iterations)) FRI domain length is 2^20