TritonVM / triton-vm

Triton is a virtual machine that comes with Algebraic Execution Tables (AET) and Arithmetic Intermediate Representations (AIR) for use in combination with a STARK proof system.
https://triton-vm.org
Apache License 2.0
223 stars 35 forks source link

Parallelize Filling of Degree-Lowering Table #284

Closed aszepieniec closed 1 month ago

aszepieniec commented 1 month ago

This PR modifies the auto-generated code for filling the degree-lowering table, essentially upgrading it from sequential to parallel iteration. To achieve this, the master base and extension tables are split into left and right parts at the column index separating original columns from degree-lowering columns. Then parallel iteration over the rows allows filling in the degree-lowering rows left to right. No value in any degree-lowering row depends on values to its right.

A complication arises when filling the degree-lowering columns for transition constraints as the degree-lowering values depend on two rows, current and next. The solution is already implied by the AIR constraints, which ensures that all degree-lowering values live in the current row. Therefore, by parallel iteration over single rows of the degree-lowering part, and overlapping row pairs of the original part, one can fill the former left to right. Note that the overlapping row pairs do not interfere with parallel execution because these are immutable references. Rust disallows multiple mutable references to the same data, but in this case only the right half of the table is mutable, and there we select one row per iteration.

Much of the kudos goes to @jan-ferdinand who came up with the blueprint for the solution.

On my desktop machine, benchmarking prove_halt --

sequential: image

parallel: image

jan-ferdinand commented 1 month ago

I'd like to point out that in the screenshots, the number of iterations is different: 292 in the sequential case, 347 in the parallel. This gives the following mean times & improvements:

mean [ms] old new
pad 1.62 0.29 -82%
fill 1.49 0.16 -89%

Excellent work.

Sword-Smith commented 1 month ago

I benchmarked this on our reference machine, and it gives an additional 2kHz on top of #283 :)

After the #283 branch was rebased against this branch:

### Prove Fibonacci 00000000000000102400       70.35s    #Reps   Share  Category
├─Fiat-Shamir: claim                            4.26µs       1   0.00%  (hash –  0.00%)
├─derive additional parameters                 11.96µs       1   0.00%  
├─base tables                                  30.95s        1  43.99%  
│ ├─create                                      2.36s        1   3.36%  (gen  – 28.76%)
│ ├─pad                                       351.50ms       1   0.50%  (gen  –  4.28%)
│ │ ├─pad original tables                     178.15ms       1   0.25%  
│ │ └─fill degree-lowering table              173.28ms       1   0.25%  
│ ├─randomize trace                            83.18ms       1   0.12%  (gen  –  1.01%)
│ ├─LDE                                        17.56s        1  24.96%  (LDE  – 49.88%)
│ │ ├─polynomial zero-initialization            3.45µs       1   0.00%  
│ │ ├─interpolation                             2.38s        1   3.38%  
│ │ ├─resize                                  772.00ns       1   0.00%  
│ │ ├─memory warmup                           866.28ms       1   1.23%  
│ │ ├─evaluation                               14.32s        1  20.35%  
│ │ └─memoize                                 772.00ns       1   0.00%  
│ ├─Merkle tree                                 5.23s        1   7.44%  (hash – 48.16%)
│ │ ├─leafs                                     4.80s        1   6.83%  
│ │ └─Merkle tree                             406.91ms       1   0.58%  
│ ├─Fiat-Shamir                                24.68µs       1   0.00%  (hash –  0.00%)
│ └─extend                                      5.36s        1   7.62%  (gen  – 65.25%)
├─ext tables                                   15.92s        1  22.64%  
│ ├─randomize trace                            57.53ms       1   0.08%  (gen  –  0.70%)
│ ├─LDE                                        10.98s        1  15.60%  (LDE  – 31.18%)
│ │ ├─polynomial zero-initialization            1.83µs       1   0.00%  
│ │ ├─interpolation                             1.93s        1   2.75%  
│ │ ├─resize                                    1.16µs       1   0.00%  
│ │ ├─memory warmup                           613.01ms       1   0.87%  
│ │ ├─evaluation                                8.36s        1  11.88%  
│ │ └─memoize                                   1.53µs       1   0.00%  
│ ├─Merkle tree                                 4.89s        1   6.95%  (hash – 45.04%)
│ │ ├─leafs                                     4.01s        1   5.71%  
│ │ └─Merkle tree                             756.02ms       1   1.07%  
│ └─Fiat-Shamir                               127.55µs       1   0.00%  (hash –  0.00%)
├─quotient calculation (cached)                 4.74s        1   6.73%  (CC   – 62.59%)
│ ├─zerofier inverse                          864.25ms       1   1.23%  
│ └─evaluate AIR, compute quotient codeword     3.73s        1   5.31%  
├─quotient LDE                                  6.67s        1   9.48%  (LDE  – 18.94%)
├─hash rows of quotient segments              342.65ms       1   0.49%  (hash –  3.15%)
├─Merkle tree                                 396.10ms       1   0.56%  (hash –  3.65%)
├─out-of-domain rows                          605.21ms       1   0.86%  
├─Fiat-Shamir                                  85.84µs       1   0.00%  (hash –  0.00%)
├─linear combination                            2.98s        1   4.23%  
│ ├─base                                      515.97ms       1   0.73%  (CC   –  6.82%)
│ ├─ext                                        75.43ms       1   0.11%  (CC   –  1.00%)
│ └─quotient                                    1.13s        1   1.61%  (CC   – 14.93%)
├─DEEP                                          1.16s        1   1.66%  
│ ├─base&ext curr row                         365.85ms       1   0.52%  
│ ├─base&ext next row                         393.85ms       1   0.56%  
│ └─segmented quotient                        404.62ms       1   0.58%  
├─combined DEEP polynomial                      1.19s        1   1.69%  
│ └─sum                                         1.11s        1   1.58%  (CC   – 14.66%)
├─FRI                                           1.46s        1   2.07%  
└─open trace leafs                              1.56ms       1   0.00%  

### Categories
LDE     35.20s  50.04%
hash    10.86s  15.44%
gen      8.22s  11.68%
CC       7.57s  10.75%

Clock frequency is 14555 Hz (1024009 clock cycles / 70353 ms / 1 iterations)
Optimal clock frequency is 14904 Hz (1048576 padded height / 70353 ms / 1 iterations)
FRI domain length is 2^23