TritonVM / triton-vm

Triton is a virtual machine that comes with Algebraic Execution Tables (AET) and Arithmetic Intermediate Representations (AIR) for use in combination with a STARK proof system.
https://triton-vm.org
Apache License 2.0
223 stars 35 forks source link

Speed up calls to `ndarray`'s `.zeros()` #274

Closed Sword-Smith closed 1 month ago

Sword-Smith commented 1 month ago

I wrapped the two calls tondarray's zeros in low_degree_extend_all_columns in the timing-profiler to measure how long they use, and found that for a FRI domain length of $2^{20}$ those functions take 5 % of total prover time to execute. This might be a good place to use unsafe code, or do something else. Because we're writing zeros to memory cells that will be overwritten again before being read.

### Prove Fibonacci 00000000000000010000       27.74s    Share  Category
├─Fiat-Shamir: claim                           35.30µs   0.00% (hash –  0.00%)
├─derive additional parameters                 36.96µs   0.00% 
├─base tables                                  13.60s   49.04% 
│ ├─create                                    202.34ms   0.73% (gen  –  9.81%)
│ ├─pad                                       872.07ms   3.14% (gen  – 42.26%)
│ ├─randomize trace                            46.75ms   0.17% (gen  –  2.27%)
│ ├─LDE                                         7.35s   26.50% (LDE  – 30.74%)
│ │ ├─LDE-zeros                               716.63ms   2.58% (LDE  –  3.00%) <-- HERE
│ │ └─LDE-inner                                 6.09s   21.95% (LDE  – 25.46%)
│ ├─Merkle tree                                 4.22s   15.22% (hash – 52.00%)
│ │ ├─leafs                                     4.11s   14.81% 
│ │ └─Merkle tree                             109.22ms   0.39% 
│ ├─Fiat-Shamir                                72.01µs   0.00% (hash –  0.00%)
│ └─extend                                    909.17ms   3.28% (gen  – 44.06%)
├─ext tables                                    8.11s   29.23% 
│ ├─randomize trace                            33.20ms   0.12% (gen  –  1.61%)
│ ├─LDE                                         4.64s   16.73% (LDE  – 19.40%)
│ │ ├─LDE-zeros                               452.08ms   1.63% (LDE  –  1.89%) <-- AND HERE
│ │ └─LDE-inner                                 4.19s   15.10% (LDE  – 17.51%)
│ ├─Merkle tree                                 3.44s   12.39% (hash – 42.33%)
│ │ ├─leafs                                     3.33s   12.00% 
│ │ └─Merkle tree                             105.38ms   0.38% 
│ └─Fiat-Shamir                               134.42µs   0.00% (hash –  0.00%)
├─quotient-domain codewords                   487.00ns   0.00% 
├─compute and combine quotient codewords        2.90s   10.46% (CC   – 87.06%)
│ ├─zerofier inverse                           76.11ms   0.27% 
│ └─evaluate AIR, compute quotient codeword     2.83s   10.19% 
├─commit to quotient codeword segments        936.35ms   3.38% 
│ ├─LDE                                       477.00ms   1.72% (LDE  –  1.99%)
│ ├─hash rows of quotient segments            355.29ms   1.28% (hash –  4.38%)
│ └─Merkle tree                               104.06ms   0.38% (hash –  1.28%)
├─out-of-domain rows                          407.88ms   1.47% 
├─Fiat-Shamir                                  84.56µs   0.00% (hash –  0.00%)
├─linear combination                          354.90ms   1.28% 
│ ├─base                                      194.64ms   0.70% (CC   –  5.84%)
│ ├─ext                                       149.01ms   0.54% (CC   –  4.47%)
│ └─quotient                                    8.04ms   0.03% (CC   –  0.24%)
├─DEEP                                        861.83ms   3.11% 
│ ├─interpolate                               203.75ms   0.73% 
│ ├─base&ext curr row                         209.39ms   0.75% 
│ ├─base&ext next row                         226.05ms   0.81% 
│ └─segmented quotient                        222.62ms   0.80% 
├─combined DEEP polynomial                     81.06ms   0.29% 
│ └─sum                                        79.64ms   0.29% (CC   –  2.39%)
├─FRI                                         242.67ms   0.87% 
└─open trace leafs                              1.12ms   0.00% 

### Categories
LDE     23.91s  86.20%
hash     8.12s  29.26%
CC       3.33s  12.02%
gen      2.06s   7.44%

Clock frequency is 3605 Hz (100009 clock cycles / 27739 ms)
Optimal clock frequency is 4725 Hz (131072 padded height / 27739 ms)
FRI domain length is 2^20

Cf.

        prof_start!(maybe_profiler, "LDE-zeros", "LDE");
        let mut interpolation_polynomials = Array1::zeros(num_columns);
        let mut extended_columns = Array2::zeros([num_rows, num_columns]);
        prof_stop!(maybe_profiler, "LDE-zeros");
        prof_start!(maybe_profiler, "LDE-inner", "LDE");
        Zip::from(extended_columns.axis_iter_mut(Axis(1)))
            .and(self.randomized_trace_table().axis_iter(Axis(1)))
            .and(interpolation_polynomials.axis_iter_mut(Axis(0)))
            .par_for_each(|lde_column, trace_column, poly| {
                let trace_column = trace_column.as_slice().unwrap();
                let interpolation_polynomial = randomized_trace_domain.interpolate(trace_column);
                let lde_codeword = evaluation_domain.evaluate(&interpolation_polynomial);
                Array1::from(lde_codeword).move_into(lde_column);
                Array0::from_elem((), interpolation_polynomial).move_into(poly);
            });
        prof_stop!(maybe_profiler, "LDE-inner");
jan-ferdinand commented 1 month ago

Fixed by f7b13e741f69729f7cfa2b8088785193ed1271dd