TritonVM / triton-vm

Triton is a virtual machine that comes with Algebraic Execution Tables (AET) and Arithmetic Intermediate Representations (AIR) for use in combination with a STARK proof system.
https://triton-vm.org
Apache License 2.0
223 stars 35 forks source link

perf: Profile and fix slow zero-initialization #283

Closed aszepieniec closed 1 month ago

aszepieniec commented 1 month ago

This fix uses the unsafe keyword, but we are convinced that this is safe because all elements are being written to before read from.

Sword-Smith commented 1 month ago

With the memory warmup, that I added in my latest commit to this branch, this PR gives a speedup of about 2kHz on a canonical parameter set on our benchmark machine.

Current master

### Prove Fibonacci 00000000000000102400       94.61s    #Reps   Share  Category
├─Fiat-Shamir: claim                            3.71µs       1   0.00%  (hash –  0.00%)
├─derive additional parameters                 18.95µs       1   0.00%  
├─base tables                                  50.73s        1  53.61%  
│ ├─create                                      2.38s        1   2.52%  (gen  – 11.82%)
│ ├─pad                                         9.23s        1   9.76%  (gen  – 45.77%)
│ ├─randomize trace                            86.38ms       1   0.09%  (gen  –  0.43%)
│ ├─LDE                                        25.31s        1  26.75%  (LDE  – 52.87%)
│ ├─Merkle tree                                 5.31s        1   5.61%  (hash – 50.46%)
│ │ ├─leafs                                     4.80s        1   5.07%  
│ │ └─Merkle tree                             489.42ms       1   0.52%  
│ ├─Fiat-Shamir                                24.63µs       1   0.00%  (hash –  0.00%)
│ └─extend                                      8.41s        1   8.89%  (gen  – 41.69%)
├─ext tables                                   20.40s        1  21.56%  
│ ├─randomize trace                            58.57ms       1   0.06%  (gen  –  0.29%)
│ ├─LDE                                        15.86s        1  16.76%  (LDE  – 33.13%)
│ ├─Merkle tree                                 4.48s        1   4.73%  (hash – 42.54%)
│ │ ├─leafs                                     4.06s        1   4.29%  
│ │ └─Merkle tree                             398.66ms       1   0.42%  
│ └─Fiat-Shamir                               124.96µs       1   0.00%  (hash –  0.00%)
├─quotient calculation (cached)                 5.43s        1   5.74%  (CC   – 68.68%)
│ ├─zerofier inverse                            1.19s        1   1.25%  
│ └─evaluate AIR, compute quotient codeword     3.75s        1   3.97%  
├─quotient LDE                                  6.70s        1   7.08%  (LDE  – 14.00%)
├─hash rows of quotient segments              341.03ms       1   0.36%  (hash –  3.24%)
├─Merkle tree                                 394.98ms       1   0.42%  (hash –  3.75%)
├─out-of-domain rows                          626.86ms       1   0.66%  
├─Fiat-Shamir                                  85.34µs       1   0.00%  (hash –  0.00%)
├─linear combination                            2.87s        1   3.03%  
│ ├─base                                      416.99ms       1   0.44%  (CC   –  5.27%)
│ ├─ext                                        59.89ms       1   0.06%  (CC   –  0.76%)
│ └─quotient                                    1.14s        1   1.21%  (CC   – 14.45%)
├─DEEP                                          1.17s        1   1.23%  
│ ├─base&ext curr row                         363.59ms       1   0.38%  
│ ├─base&ext next row                         398.33ms       1   0.42%  
│ └─segmented quotient                        404.52ms       1   0.43%  
├─combined DEEP polynomial                    937.28ms       1   0.99%  
│ └─sum                                       857.15ms       1   0.91%  (CC   – 10.84%)
├─FRI                                           1.45s        1   1.53%  
└─open trace leafs                              1.50ms       1   0.00%  

### Categories
LDE     47.87s  50.59%
gen     20.17s  21.32%
hash    10.53s  11.13%
CC       7.91s   8.36%

Clock frequency is 10823 Hz (1024009 clock cycles / 94610 ms / 1 iterations)
Optimal clock frequency is 11083 Hz (1048576 padded height / 94610 ms / 1 iterations)
FRI domain length is 2^23

This PR

### Prove Fibonacci 00000000000000102400       80.96s    #Reps   Share  Category
├─Fiat-Shamir: claim                            3.97µs       1   0.00%  (hash –  0.00%)
├─derive additional parameters                 22.96µs       1   0.00%  
├─base tables                                  42.71s        1  52.76%  
│ ├─create                                      2.37s        1   2.93%  (gen  – 11.70%)
│ ├─pad                                         9.44s        1  11.66%  (gen  – 46.61%)
│ ├─randomize trace                            85.04ms       1   0.11%  (gen  –  0.42%)
│ ├─LDE                                        17.25s        1  21.31%  (LDE  – 50.29%)
│ │ ├─polynomial zero-initialization            3.00µs       1   0.00%  
│ │ ├─interpolation                             2.40s        1   2.97%  
│ │ ├─resize                                    1.06µs       1   0.00%  
│ │ ├─memory warmup                           867.27ms       1   1.07%  
│ │ ├─evaluation                               13.98s        1  17.27%  
│ │ └─memoize                                   1.24µs       1   0.00%  
│ ├─Merkle tree                                 5.26s        1   6.50%  (hash – 49.00%)
│ │ ├─leafs                                     4.82s        1   5.96%  
│ │ └─Merkle tree                             416.41ms       1   0.51%  
│ ├─Fiat-Shamir                                24.56µs       1   0.00%  (hash –  0.00%)
│ └─extend                                      8.30s        1  10.25%  (gen  – 40.98%)
├─ext tables                                   15.35s        1  18.96%  
│ ├─randomize trace                            59.74ms       1   0.07%  (gen  –  0.29%)
│ ├─LDE                                        10.54s        1  13.02%  (LDE  – 30.74%)
│ │ ├─polynomial zero-initialization            2.37µs       1   0.00%  
│ │ ├─interpolation                             1.91s        1   2.36%  
│ │ ├─resize                                    1.24µs       1   0.00%  
│ │ ├─memory warmup                           613.71ms       1   0.76%  
│ │ ├─evaluation                                8.02s        1   9.90%  
│ │ └─memoize                                   1.06µs       1   0.00%  
│ ├─Merkle tree                                 4.74s        1   5.86%  (hash – 44.15%)
│ │ ├─leafs                                     3.90s        1   4.82%  
│ │ └─Merkle tree                             726.15ms       1   0.90%  
│ └─Fiat-Shamir                               129.19µs       1   0.00%  (hash –  0.00%)
├─quotient calculation (cached)                 4.84s        1   5.98%  (CC   – 65.29%)
│ ├─zerofier inverse                          875.15ms       1   1.08%  
│ └─evaluate AIR, compute quotient codeword     3.74s        1   4.62%  
├─quotient LDE                                  6.51s        1   8.04%  (LDE  – 18.98%)
├─hash rows of quotient segments              340.45ms       1   0.42%  (hash –  3.17%)
├─Merkle tree                                 395.03ms       1   0.49%  (hash –  3.68%)
├─out-of-domain rows                          536.75ms       1   0.66%  
├─Fiat-Shamir                                  86.13µs       1   0.00%  (hash –  0.00%)
├─linear combination                            2.89s        1   3.57%  
│ ├─base                                      224.35ms       1   0.28%  (CC   –  3.02%)
│ ├─ext                                       286.31ms       1   0.35%  (CC   –  3.86%)
│ └─quotient                                    1.12s        1   1.38%  (CC   – 15.03%)
├─DEEP                                          1.16s        1   1.44%  
│ ├─base&ext curr row                         366.18ms       1   0.45%  
│ ├─base&ext next row                         398.12ms       1   0.49%  
│ └─segmented quotient                        398.08ms       1   0.49%  
├─combined DEEP polynomial                      1.03s        1   1.27%  
│ └─sum                                       949.48ms       1   1.17%  (CC   – 12.80%)
├─FRI                                           1.49s        1   1.84%  
└─open trace leafs                              1.47ms       1   0.00%  

### Categories
LDE     34.31s  42.37%
gen     20.26s  25.02%
hash    10.74s  13.27%
CC       7.42s   9.16%

Clock frequency is 12648 Hz (1024009 clock cycles / 80958 ms / 1 iterations)
Optimal clock frequency is 12952 Hz (1048576 padded height / 80958 ms / 1 iterations)
FRI domain length is 2^23
jan-ferdinand commented 1 month ago

The line you added also makes it glaringly obvious that every element is in fact written to before it is read from, which I like tremendously.