Use randomness more efficiently

jan-ferdinand commented 1 month ago

Remove the “randomized trace table,” which stored a copy of the entire trace interleaved with randomness.

jan-ferdinand commented 1 month ago

My measurements of proving the program fibonacci with input 10000 indicate that your remarks regarding growth of the memory requirements in the quotienting step are accurate. However, they are offset by savings in other places, making the suggested changes zero-sum with respect to memory consumption. Note that the total numbers are rounded very liberally, while the percentages provide higher precision.

with explicit randomizers (old)

operation	abs [GB]	relative
total	3.2	100.0%
compute_quotient_segments	1.5	46.7%
MainTable::new	0.8	24.8%
MainTable::extend	0.5	17.2%

without explicit randomizers (new)

operation	abs [GB]	relative
total	3.2	100.0%
compute_quotient_segments	1.8	57.0%
MainTable::new	0.4	12.4%
MainTable::extend	0.3	8.6%

That said, it is worth some additional effort to bring compute_quotient_segments back down again, netting a win.

jan-ferdinand commented 1 month ago

With the most recent changes, we now have:

basically “Workflow 2” (new new)

operation	abs [GB]	relative
total	2.5	100.0%
compute_quotient_segments	1.4	53.6%
MainTable::new	0.4	15.7%
MainTable::extend	0.3	10.9%

The total has dropped by 0.7 GB (~20%). The difference in compute_quotient_segments is a measurable 0.4GB (~12%) but does not account for the entire delta, which I find surprising. It could be that the call graph is somewhat obscured due to parallelization, which is now used slightly differently.

jan-ferdinand commented 3 weeks ago

With the most recent changes, we now have:

basically “Workflow 3” (new new new)

operation	abs [GB]	relative
total	1.8	100.0%
compute_quotient_segments	0.8	44.9%
MainTable::new	0.4	21.6%
MainTable::extend	0.3	15.1%

The total has dropped by another 0.7 GB (~28%), all of which seems to be from savings in compute_quotient_segments.

Additionally, my benchmarks indicate that runtime performance has improved a tad.

jan-ferdinand commented 2 weeks ago

do tell please, what's the magnitude of tad?

The runtime performance gain of the memory efficient path that comes with this PR[^tad] is 15%. :tada:

Profile at branch's base


Prove Fibonacci 10000   time:   [10.097 s 10.112 s 10.128 s]


### Prove Fibonacci 10000                         111.46s    #Reps   Share  Category                                                   
├─Fiat-Shamir: claim                              100.63µs      11   0.00%  (hash –  0.00%)                                            
├─derive additional parameters                    130.27µs      11   0.00%                                                             
├─main tables                                      33.10s       11  29.70%                                                             
│ ├─create                                          3.61s       11   3.24%  (gen  – 73.94%)                                            
│ ├─pad                                           517.12ms      11   0.46%  (gen  – 10.58%)                                            
│ │ ├─pad original tables                         256.78ms      11   0.23%                                                             
│ │ └─fill degree-lowering table                  260.23ms      11   0.23%                                                             
│ ├─randomize trace                               143.19ms      11   0.13%  (gen  –  2.93%)                                            
│ ├─LDE                                            12.18µs      11   0.00%  (LDE  –  0.00%)                                            
│ ├─Merkle tree                                    28.33s       11  25.42%  (hash – 46.16%)                                            
│ │ ├─leafs                                        27.63s       11  24.78%                                                             
│ │ └─Merkle tree                                 702.06ms      11   0.63%                                                             
│ ├─Fiat-Shamir                                   340.85µs      11   0.00%  (hash –  0.00%)                                            
│ └─extend                                        493.85ms      11   0.44%  (gen  – 10.11%)                                            
│   ├─initialize master table                     253.06ms      11   0.23%                                                             
│   ├─slice master table                           95.73µs      11   0.00%                                                             
│   ├─all tables                                  120.70ms      11   0.11%                                                             
│   └─fill degree lowering table                  119.82ms      11   0.11%                                                             
├─aux tables                                       32.00s       11  28.71%                                                             
│ ├─randomize trace                               119.03ms      11   0.11%  (gen  –  2.44%)                                            
│ ├─LDE                                            12.79µs      11   0.00%  (LDE  –  0.00%)                                            
│ ├─Merkle tree                                    31.88s       11  28.61%  (hash – 51.95%)                                            
│ │ ├─leafs                                        31.22s       11  28.01%                                                             
│ │ └─Merkle tree                                 660.74ms      11   0.59%                                                             
│ └─Fiat-Shamir                                     1.72ms      11   0.00%  (hash –  0.00%)                                            
├─quotient calculation (just-in-time)              30.65s       11  27.50%                                                             
│ ├─zero-initialization                           592.32ms      11   0.53%                                                             
│ ├─calculate quotients                            26.37s       11  23.66%                                                             
│ │ ├─LDE                                          18.16s       44  16.29%  (LDE  – 100.00%)                                           
│ │ └─AIR evaluation                                8.22s       44   7.37%  (AIR  – 100.00%)                                           
│ │   ├─zerofier inverse                            1.24s       44   1.11%                                                             
│ │   └─evaluate AIR, compute quotient codeword     6.96s       44   6.24%                                                             
│ └─segmentify                                      2.67s       11   2.40%                                                             
├─hash rows of quotient segments                  497.55ms      11   0.45%  (hash –  0.81%)                                            
├─Merkle tree                                     657.56ms      11   0.59%  (hash –  1.07%)                                            
├─out-of-domain rows                                2.75s       11   2.46%  
├─Fiat-Shamir                                       1.00ms      11   0.00%  (hash –  0.00%)
├─linear combination                                2.81s       11   2.52%  
│ ├─main                                          133.07ms      11   0.12%  (CC   –  8.01%)
│ ├─aux                                            91.31ms      11   0.08%  (CC   –  5.50%)
│ └─quotient                                        1.23s       11   1.11%  (CC   – 74.27%)
├─DEEP                                              1.12s       11   1.01%  
│ ├─main&aux curr row                             358.60ms      11   0.32%  
│ ├─main&aux next row                             381.28ms      11   0.34%  
│ └─segmented quotient                            382.62ms      11   0.34%  
├─combined DEEP polynomial                        203.16ms      11   0.18%  
│ └─sum                                           203.12ms      11   0.18%  (CC   – 12.23%)
├─FRI                                               2.30s       11   2.06%  
└─open trace leafs                                  3.83s       11   3.43%  


### Categories
hash    61.37s  55.06%
LDE     18.16s  16.29%
AIR      8.22s   7.37%
gen      4.89s   4.38%
CC       1.66s   1.49%


Clock frequency is 9869 Hz (100009 clock cycles / (111459 ms / 11 iterations))
Optimal clock frequency is 12935 Hz (131072 padded height / (111459 ms / 11 iterations))
FRI domain length is 2^20

Profile at branch's tip


Prove Fibonacci 10000   time:   [8.5598 s 8.5781 s 8.5966 s]
                        change: [-15.388% -15.170% -14.928%] (p = 0.00 < 0.05)
                        Performance has improved.


### Prove Fibonacci 10000                          94.60s    #Reps   Share  Category
├─Fiat-Shamir: claim                               80.16µs      11   0.00%  (hash –  0.00%)
├─derive additional parameters                    125.57µs      11   0.00%  
├─main tables                                      29.07s       11  30.73%  
│ ├─create                                        451.81ms      11   0.48%  (gen  – 36.97%)
│ ├─pad                                           435.43ms      11   0.46%  (gen  – 35.63%)
│ │ ├─pad original tables                         240.52ms      11   0.25%  
│ │ └─fill degree-lowering table                  194.80ms      11   0.21%  
│ ├─Merkle tree                                    27.85s       11  29.44%  
│ │ ├─leafs                                        27.14s       11  28.69%  
│ │ │ ├─LDE                                        14.96s       33  15.81%  (LDE  – 41.64%)
│ │ │ └─hash rows                                   5.45s       33   5.76%  (hash – 43.73%)
│ │ └─Merkle tree                                 700.77ms      11   0.74%  (hash –  5.63%)
│ ├─Fiat-Shamir                                   336.35µs      11   0.00%  (hash –  0.00%)
│ └─extend                                        335.01ms      11   0.35%  (gen  – 27.41%)
│   ├─initialize master table                     150.29ms      11   0.16%  
│   ├─slice master table                           62.92µs      11   0.00%  
│   ├─all tables                                   97.81ms      11   0.10%  
│   └─fill degree lowering table                   86.70ms      11   0.09%  
├─aux tables                                       31.09s       11  32.87%  
│ ├─Merkle tree                                    31.09s       11  32.86%  
│ │ ├─leafs                                        30.42s       11  32.16%  
│ │ │ ├─LDE                                         9.61s       11  10.15%  (LDE  – 26.74%)
│ │ │ └─hash rows                                   4.51s       11   4.77%  (hash – 36.21%)
│ │ └─Merkle tree                                 665.64ms      11   0.70%  (hash –  5.35%)
│ └─Fiat-Shamir                                     1.83ms      11   0.00%  (hash –  0.01%)
├─quotient calculation (just-in-time)              22.62s       11  23.92%  
│ ├─zero-initialization                           332.69ms      11   0.35%  
│ ├─fetch trace randomizers                         3.97ms      11   0.00%  
│ ├─poly interpolate                              696.45ms      11   0.74%  (LDE  –  1.94%)
│ ├─calculate quotients                            18.27s       11  19.32%  
│ │ ├─poly evaluate                                 4.71s       88   4.98%  (LDE  – 13.11%)
│ │ ├─trace randomizers                             5.28s       88   5.58%  (LDE  – 14.69%)
│ │ └─AIR evaluation                                8.28s       88   8.76%  (AIR  – 100.00%)
│ │   ├─zerofier inverse                            1.30s       88   1.37%  
│ │   └─evaluate AIR, compute quotient codeword     6.96s       88   7.36%  
│ ├─segmentify                                      2.12s       11   2.24%  
│ └─restore original trace                        676.69ms      11   0.72%  (LDE  –  1.88%)
├─hash rows of quotient segments                  469.03ms      11   0.50%  (hash –  3.77%)
├─Merkle tree                                     659.00ms      11   0.70%  (hash –  5.29%)
├─out-of-domain rows                                1.38s       11   1.46%  
├─Fiat-Shamir                                       1.01ms      11   0.00%  (hash –  0.01%)
├─linear combination                                2.69s       11   2.84%  
│ ├─main                                          220.19ms      11   0.23%  (CC   – 12.08%)
│ ├─aux                                           193.60ms      11   0.20%  (CC   – 10.62%)
│ └─quotient                                        1.19s       11   1.26%  (CC   – 65.39%)
├─DEEP                                              1.11s       11   1.18%  
│ ├─main&aux curr row                             354.48ms      11   0.37%  
│ ├─main&aux next row                             369.62ms      11   0.39%  
│ └─segmented quotient                            387.42ms      11   0.41%  
├─combined DEEP polynomial                        217.43ms      11   0.23%  
│ └─sum                                           217.39ms      11   0.23%  (CC   – 11.92%)
├─FRI                                               2.34s       11   2.47%  
└─open trace leafs                                  1.91s       11   2.02%  
  └─recompute rows                                  1.91s       22   2.02%  


### Categories
LDE     35.93s  37.98%
hash    12.45s  13.16%
AIR      8.28s   8.76%
CC       1.82s   1.93%
gen      1.22s   1.29%


Clock frequency is 11629 Hz (100009 clock cycles / (94596 ms / 11 iterations))
Optimal clock frequency is 15241 Hz (131072 padded height / (94596 ms / 11 iterations))
FRI domain length is 2^20

A substantial amount of the savings are from witness generation. This makes sense, as the tables held in memory are now half as large, and we now only generate exactly as much randomness as required for ZK.

Note that the large shift away from the category “hash” is due to now-correct annotations.

[^tad]: i.e., not exactly the “tad” from above, which was talking about the performance improvements of the latest commit, which are less relevant in the grand scheme of this PR

jan-ferdinand commented 2 weeks ago

The runtime performance gain of the caching path that comes with this PR is 10%. :tada:

Profile at branch's base

Prove Fibonacci 10000   time:   [6.5011 s 6.5121 s 6.5221 s]


### Prove Fibonacci 10000                      72.05s    #Reps   Share  Category                                                       
├─Fiat-Shamir: claim                          111.03µs      11   0.00%  (hash –  0.00%)                                                
├─derive additional parameters                121.68µs      11   0.00%  
├─main tables                                  27.90s       11  38.72%  
│ ├─create                                      3.53s       11   4.90%  (gen  – 73.90%)
│ ├─pad                                       489.70ms      11   0.68%  (gen  – 10.25%)
│ │ ├─pad original tables                     237.87ms      11   0.33%  
│ │ └─fill degree-lowering table              251.72ms      11   0.35%  
│ ├─randomize trace                           141.57ms      11   0.20%  (gen  –  2.96%)
│ ├─LDE                                        15.69s       11  21.78%  (LDE  – 51.19%)
│ │ ├─polynomial zero-initialization           35.95µs      11   0.00%  
│ │ ├─interpolation                           970.66ms      11   1.35%  
│ │ ├─resize                                    1.26s       11   1.75%  
│ │ ├─evaluation                               13.29s       11  18.45%  
│ │ └─memoize                                   8.03µs      11   0.00%  
│ ├─Merkle tree                                 7.54s       11  10.47%  (hash – 50.06%)
│ │ ├─leafs                                     6.87s       11   9.53%  
│ │ └─Merkle tree                             669.20ms      11   0.93%  
│ ├─Fiat-Shamir                               334.26µs      11   0.00%  (hash –  0.00%)
│ └─extend                                    499.51ms      11   0.69%  (gen  – 10.45%)
│   ├─initialize master table                 256.70ms      11   0.36%  
│   ├─slice master table                       65.68µs      11   0.00%  
│   ├─all tables                              120.33ms      11   0.17%  
│   └─fill degree lowering table              122.27ms      11   0.17%  
├─aux tables                                   17.84s       11  24.76%  
│ ├─randomize trace                           116.29ms      11   0.16%  (gen  –  2.43%)
│ ├─LDE                                        11.36s       11  15.77%  (LDE  – 37.07%)
│ │ ├─polynomial zero-initialization           21.19µs      11   0.00%  
│ │ ├─interpolation                             1.15s       11   1.59%  
│ │ ├─resize                                  880.49ms      11   1.22%  
│ │ ├─evaluation                                9.34s       11  12.96%  
│ │ └─memoize                                   9.10µs      11   0.00%  
│ ├─Merkle tree                                 6.36s       11   8.83%  (hash – 42.23%)
│ │ ├─leafs                                     5.69s       11   7.90%  
│ │ └─Merkle tree                             667.54ms      11   0.93%  
│ └─Fiat-Shamir                                 1.64ms      11   0.00%  (hash –  0.01%)
├─quotient calculation (cached)                 6.88s       11   9.54%  (CC   – 80.43%)
│ ├─zerofier inverse                          992.21ms      11   1.38%  
│ └─evaluate AIR, compute quotient codeword     5.88s       11   8.17%  
├─quotient LDE                                  3.60s       11   4.99%  (LDE  – 11.74%)
├─hash rows of quotient segments              493.16ms      11   0.68%  (hash –  3.27%)
├─Merkle tree                                 664.39ms      11   0.92%  (hash –  4.41%)
├─out-of-domain rows                            2.74s       11   3.80%  
├─Fiat-Shamir                                   1.02ms      11   0.00%  (hash –  0.01%)
├─linear combination                            2.83s       11   3.92%  
│ ├─main                                      134.71ms      11   0.19%  (CC   –  1.58%)
│ ├─aux                                        93.65ms      11   0.13%  (CC   –  1.10%)
│ └─quotient                                    1.24s       11   1.72%  (CC   – 14.50%)
├─DEEP                                          1.14s       11   1.58%  
│ ├─main&aux curr row                         360.76ms      11   0.50%  
│ ├─main&aux next row                         384.33ms      11   0.53%  
│ └─segmented quotient                        390.91ms      11   0.54%  
├─combined DEEP polynomial                    205.40ms      11   0.29%  
│ └─sum                                       205.36ms      11   0.29%  (CC   –  2.40%)
├─FRI                                           2.40s       11   3.33%  
└─open trace leafs                              7.62ms      11   0.01%  


### Categories
LDE     30.65s  42.55%
hash    15.06s  20.90%
CC       8.55s  11.87%
gen      4.78s   6.63%


Clock frequency is 15269 Hz (100009 clock cycles / (72046 ms / 11 iterations))
Optimal clock frequency is 20012 Hz (131072 padded height / (72046 ms / 11 iterations))
FRI domain length is 2^20

Profile at branch's tip

Prove Fibonacci 10000   time:   [5.8572 s 5.8721 s 5.8835 s]
                        change: [-10.083% -9.8281% -9.5672%] (p = 0.00 < 0.05)
                        Performance has improved.


### Prove Fibonacci 10000                      64.80s    #Reps   Share  Category
├─Fiat-Shamir: claim                          117.95µs      11   0.00%  (hash –  0.00%)
├─derive additional parameters                136.39µs      11   0.00%  
├─main tables                                  23.77s       11  36.68%  
│ ├─create                                    443.93ms      11   0.69%  (gen  – 36.74%)
│ ├─pad                                       437.82ms      11   0.68%  (gen  – 36.24%)
│ │ ├─pad original tables                     238.96ms      11   0.37%  
│ │ └─fill degree-lowering table              198.74ms      11   0.31%  
│ ├─LDE                                        15.05s       11  23.23%  (LDE  – 51.17%)
│ │ ├─polynomial zero-initialization           25.70µs      11   0.00%  
│ │ ├─interpolation                           637.42ms      11   0.98%  
│ │ ├─resize                                    1.25s       11   1.94%  
│ │ ├─evaluation                               13.16s       11  20.31%  
│ │ └─memoize                                   8.25µs      11   0.00%  
│ ├─Merkle tree                                 7.51s       11  11.58%  
│ │ ├─leafs                                     6.84s       11  10.56%  
│ │ │ └─hash rows                               6.84s       11  10.56%  (hash – 45.82%)
│ │ └─Merkle tree                             661.81ms      11   1.02%  (hash –  4.43%)
│ ├─Fiat-Shamir                               351.57µs      11   0.00%  (hash –  0.00%)
│ └─extend                                    326.39ms      11   0.50%  (gen  – 27.02%)
│   ├─initialize master table                 148.76ms      11   0.23%  
│   ├─slice master table                       61.77µs      11   0.00%  
│   ├─all tables                               93.12ms      11   0.14%  
│   └─fill degree lowering table               84.30ms      11   0.13%  
├─aux tables                                   16.89s       11  26.07%  
│ ├─LDE                                        10.61s       11  16.37%  (LDE  – 36.06%)
│ │ ├─polynomial zero-initialization           18.20µs      11   0.00%  
│ │ ├─interpolation                           513.83ms      11   0.79%  
│ │ ├─resize                                  879.87ms      11   1.36%  
│ │ ├─evaluation                                9.21s       11  14.22%  
│ │ └─memoize                                   7.91µs      11   0.00%  
│ ├─Merkle tree                                 6.28s       11   9.69%  
│ │ ├─leafs                                     5.61s       11   8.66%  
│ │ │ └─hash rows                               5.61s       11   8.66%  (hash – 37.59%)
│ │ └─Merkle tree                             665.64ms      11   1.03%  (hash –  4.46%)
│ └─Fiat-Shamir                                 1.70ms      11   0.00%  (hash –  0.01%)
├─quotient calculation (cached)                 6.80s       11  10.50%  (CC   – 79.12%)
│ ├─zerofier inverse                          987.79ms      11   1.52%  
│ └─evaluate AIR, compute quotient codeword     5.81s       11   8.97%  
├─quotient LDE                                  3.76s       11   5.80%  (LDE  – 12.77%)
├─hash rows of quotient segments              495.65ms      11   0.76%  (hash –  3.32%)
├─Merkle tree                                 650.83ms      11   1.00%  (hash –  4.36%)
├─out-of-domain rows                            1.31s       11   2.02%  
├─Fiat-Shamir                                   1.00ms      11   0.00%  (hash –  0.01%)
├─linear combination                            2.73s       11   4.22%  
│ ├─main                                      209.11ms      11   0.32%  (CC   –  2.43%)
│ ├─aux                                       184.97ms      11   0.29%  (CC   –  2.15%)
│ └─quotient                                    1.20s       11   1.85%  (CC   – 13.93%)
├─DEEP                                          1.09s       11   1.69%  
│ ├─main&aux curr row                         350.01ms      11   0.54%  
│ ├─main&aux next row                         370.70ms      11   0.57%  
│ └─segmented quotient                        372.15ms      11   0.57%  
├─combined DEEP polynomial                    202.92ms      11   0.31%  
│ └─sum                                       202.88ms      11   0.31%  (CC   –  2.36%)
├─FRI                                           2.25s       11   3.47%  
└─open trace leafs                              7.42ms      11   0.01%  


### Categories
LDE     29.41s  45.39%
hash    14.93s  23.04%
CC       8.60s  13.26%
gen      1.21s   1.86%


Clock frequency is 16976 Hz (100009 clock cycles / (64800 ms / 11 iterations))
Optimal clock frequency is 22249 Hz (131072 padded height / (64800 ms / 11 iterations))
FRI domain length is 2^20

TritonVM / triton-vm