Closed aszepieniec closed 1 month ago
I'd like to point out that in the screenshots, the number of iterations is different: 292 in the sequential case, 347 in the parallel. This gives the following mean times & improvements:
mean [ms] | old | new | |
---|---|---|---|
pad | 1.62 | 0.29 | -82% |
fill | 1.49 | 0.16 | -89% |
Excellent work.
I benchmarked this on our reference machine, and it gives an additional 2kHz on top of #283 :)
After the #283 branch was rebased against this branch:
### Prove Fibonacci 00000000000000102400 70.35s #Reps Share Category
├─Fiat-Shamir: claim 4.26µs 1 0.00% (hash – 0.00%)
├─derive additional parameters 11.96µs 1 0.00%
├─base tables 30.95s 1 43.99%
│ ├─create 2.36s 1 3.36% (gen – 28.76%)
│ ├─pad 351.50ms 1 0.50% (gen – 4.28%)
│ │ ├─pad original tables 178.15ms 1 0.25%
│ │ └─fill degree-lowering table 173.28ms 1 0.25%
│ ├─randomize trace 83.18ms 1 0.12% (gen – 1.01%)
│ ├─LDE 17.56s 1 24.96% (LDE – 49.88%)
│ │ ├─polynomial zero-initialization 3.45µs 1 0.00%
│ │ ├─interpolation 2.38s 1 3.38%
│ │ ├─resize 772.00ns 1 0.00%
│ │ ├─memory warmup 866.28ms 1 1.23%
│ │ ├─evaluation 14.32s 1 20.35%
│ │ └─memoize 772.00ns 1 0.00%
│ ├─Merkle tree 5.23s 1 7.44% (hash – 48.16%)
│ │ ├─leafs 4.80s 1 6.83%
│ │ └─Merkle tree 406.91ms 1 0.58%
│ ├─Fiat-Shamir 24.68µs 1 0.00% (hash – 0.00%)
│ └─extend 5.36s 1 7.62% (gen – 65.25%)
├─ext tables 15.92s 1 22.64%
│ ├─randomize trace 57.53ms 1 0.08% (gen – 0.70%)
│ ├─LDE 10.98s 1 15.60% (LDE – 31.18%)
│ │ ├─polynomial zero-initialization 1.83µs 1 0.00%
│ │ ├─interpolation 1.93s 1 2.75%
│ │ ├─resize 1.16µs 1 0.00%
│ │ ├─memory warmup 613.01ms 1 0.87%
│ │ ├─evaluation 8.36s 1 11.88%
│ │ └─memoize 1.53µs 1 0.00%
│ ├─Merkle tree 4.89s 1 6.95% (hash – 45.04%)
│ │ ├─leafs 4.01s 1 5.71%
│ │ └─Merkle tree 756.02ms 1 1.07%
│ └─Fiat-Shamir 127.55µs 1 0.00% (hash – 0.00%)
├─quotient calculation (cached) 4.74s 1 6.73% (CC – 62.59%)
│ ├─zerofier inverse 864.25ms 1 1.23%
│ └─evaluate AIR, compute quotient codeword 3.73s 1 5.31%
├─quotient LDE 6.67s 1 9.48% (LDE – 18.94%)
├─hash rows of quotient segments 342.65ms 1 0.49% (hash – 3.15%)
├─Merkle tree 396.10ms 1 0.56% (hash – 3.65%)
├─out-of-domain rows 605.21ms 1 0.86%
├─Fiat-Shamir 85.84µs 1 0.00% (hash – 0.00%)
├─linear combination 2.98s 1 4.23%
│ ├─base 515.97ms 1 0.73% (CC – 6.82%)
│ ├─ext 75.43ms 1 0.11% (CC – 1.00%)
│ └─quotient 1.13s 1 1.61% (CC – 14.93%)
├─DEEP 1.16s 1 1.66%
│ ├─base&ext curr row 365.85ms 1 0.52%
│ ├─base&ext next row 393.85ms 1 0.56%
│ └─segmented quotient 404.62ms 1 0.58%
├─combined DEEP polynomial 1.19s 1 1.69%
│ └─sum 1.11s 1 1.58% (CC – 14.66%)
├─FRI 1.46s 1 2.07%
└─open trace leafs 1.56ms 1 0.00%
### Categories
LDE 35.20s 50.04%
hash 10.86s 15.44%
gen 8.22s 11.68%
CC 7.57s 10.75%
Clock frequency is 14555 Hz (1024009 clock cycles / 70353 ms / 1 iterations)
Optimal clock frequency is 14904 Hz (1048576 padded height / 70353 ms / 1 iterations)
FRI domain length is 2^23
This PR modifies the auto-generated code for filling the degree-lowering table, essentially upgrading it from sequential to parallel iteration. To achieve this, the master base and extension tables are split into left and right parts at the column index separating original columns from degree-lowering columns. Then parallel iteration over the rows allows filling in the degree-lowering rows left to right. No value in any degree-lowering row depends on values to its right.
A complication arises when filling the degree-lowering columns for transition constraints as the degree-lowering values depend on two rows, current and next. The solution is already implied by the AIR constraints, which ensures that all degree-lowering values live in the current row. Therefore, by parallel iteration over single rows of the degree-lowering part, and overlapping row pairs of the original part, one can fill the former left to right. Note that the overlapping row pairs do not interfere with parallel execution because these are immutable references. Rust disallows multiple mutable references to the same data, but in this case only the right half of the table is mutable, and there we select one row per iteration.
Much of the kudos goes to @jan-ferdinand who came up with the blueprint for the solution.
On my desktop machine, benchmarking
prove_halt
--sequential:![image](https://github.com/TritonVM/triton-vm/assets/1583170/42bf1900-3b4f-4d52-be7d-060b8090a701)
parallel:![image](https://github.com/TritonVM/triton-vm/assets/1583170/d6072176-fb03-4a25-99e1-369765976350)