Closed aszepieniec closed 1 month ago
With the memory warmup, that I added in my latest commit to this branch, this PR gives a speedup of about 2kHz on a canonical parameter set on our benchmark machine.
Current master
### Prove Fibonacci 00000000000000102400 94.61s #Reps Share Category
├─Fiat-Shamir: claim 3.71µs 1 0.00% (hash – 0.00%)
├─derive additional parameters 18.95µs 1 0.00%
├─base tables 50.73s 1 53.61%
│ ├─create 2.38s 1 2.52% (gen – 11.82%)
│ ├─pad 9.23s 1 9.76% (gen – 45.77%)
│ ├─randomize trace 86.38ms 1 0.09% (gen – 0.43%)
│ ├─LDE 25.31s 1 26.75% (LDE – 52.87%)
│ ├─Merkle tree 5.31s 1 5.61% (hash – 50.46%)
│ │ ├─leafs 4.80s 1 5.07%
│ │ └─Merkle tree 489.42ms 1 0.52%
│ ├─Fiat-Shamir 24.63µs 1 0.00% (hash – 0.00%)
│ └─extend 8.41s 1 8.89% (gen – 41.69%)
├─ext tables 20.40s 1 21.56%
│ ├─randomize trace 58.57ms 1 0.06% (gen – 0.29%)
│ ├─LDE 15.86s 1 16.76% (LDE – 33.13%)
│ ├─Merkle tree 4.48s 1 4.73% (hash – 42.54%)
│ │ ├─leafs 4.06s 1 4.29%
│ │ └─Merkle tree 398.66ms 1 0.42%
│ └─Fiat-Shamir 124.96µs 1 0.00% (hash – 0.00%)
├─quotient calculation (cached) 5.43s 1 5.74% (CC – 68.68%)
│ ├─zerofier inverse 1.19s 1 1.25%
│ └─evaluate AIR, compute quotient codeword 3.75s 1 3.97%
├─quotient LDE 6.70s 1 7.08% (LDE – 14.00%)
├─hash rows of quotient segments 341.03ms 1 0.36% (hash – 3.24%)
├─Merkle tree 394.98ms 1 0.42% (hash – 3.75%)
├─out-of-domain rows 626.86ms 1 0.66%
├─Fiat-Shamir 85.34µs 1 0.00% (hash – 0.00%)
├─linear combination 2.87s 1 3.03%
│ ├─base 416.99ms 1 0.44% (CC – 5.27%)
│ ├─ext 59.89ms 1 0.06% (CC – 0.76%)
│ └─quotient 1.14s 1 1.21% (CC – 14.45%)
├─DEEP 1.17s 1 1.23%
│ ├─base&ext curr row 363.59ms 1 0.38%
│ ├─base&ext next row 398.33ms 1 0.42%
│ └─segmented quotient 404.52ms 1 0.43%
├─combined DEEP polynomial 937.28ms 1 0.99%
│ └─sum 857.15ms 1 0.91% (CC – 10.84%)
├─FRI 1.45s 1 1.53%
└─open trace leafs 1.50ms 1 0.00%
### Categories
LDE 47.87s 50.59%
gen 20.17s 21.32%
hash 10.53s 11.13%
CC 7.91s 8.36%
Clock frequency is 10823 Hz (1024009 clock cycles / 94610 ms / 1 iterations)
Optimal clock frequency is 11083 Hz (1048576 padded height / 94610 ms / 1 iterations)
FRI domain length is 2^23
This PR
### Prove Fibonacci 00000000000000102400 80.96s #Reps Share Category
├─Fiat-Shamir: claim 3.97µs 1 0.00% (hash – 0.00%)
├─derive additional parameters 22.96µs 1 0.00%
├─base tables 42.71s 1 52.76%
│ ├─create 2.37s 1 2.93% (gen – 11.70%)
│ ├─pad 9.44s 1 11.66% (gen – 46.61%)
│ ├─randomize trace 85.04ms 1 0.11% (gen – 0.42%)
│ ├─LDE 17.25s 1 21.31% (LDE – 50.29%)
│ │ ├─polynomial zero-initialization 3.00µs 1 0.00%
│ │ ├─interpolation 2.40s 1 2.97%
│ │ ├─resize 1.06µs 1 0.00%
│ │ ├─memory warmup 867.27ms 1 1.07%
│ │ ├─evaluation 13.98s 1 17.27%
│ │ └─memoize 1.24µs 1 0.00%
│ ├─Merkle tree 5.26s 1 6.50% (hash – 49.00%)
│ │ ├─leafs 4.82s 1 5.96%
│ │ └─Merkle tree 416.41ms 1 0.51%
│ ├─Fiat-Shamir 24.56µs 1 0.00% (hash – 0.00%)
│ └─extend 8.30s 1 10.25% (gen – 40.98%)
├─ext tables 15.35s 1 18.96%
│ ├─randomize trace 59.74ms 1 0.07% (gen – 0.29%)
│ ├─LDE 10.54s 1 13.02% (LDE – 30.74%)
│ │ ├─polynomial zero-initialization 2.37µs 1 0.00%
│ │ ├─interpolation 1.91s 1 2.36%
│ │ ├─resize 1.24µs 1 0.00%
│ │ ├─memory warmup 613.71ms 1 0.76%
│ │ ├─evaluation 8.02s 1 9.90%
│ │ └─memoize 1.06µs 1 0.00%
│ ├─Merkle tree 4.74s 1 5.86% (hash – 44.15%)
│ │ ├─leafs 3.90s 1 4.82%
│ │ └─Merkle tree 726.15ms 1 0.90%
│ └─Fiat-Shamir 129.19µs 1 0.00% (hash – 0.00%)
├─quotient calculation (cached) 4.84s 1 5.98% (CC – 65.29%)
│ ├─zerofier inverse 875.15ms 1 1.08%
│ └─evaluate AIR, compute quotient codeword 3.74s 1 4.62%
├─quotient LDE 6.51s 1 8.04% (LDE – 18.98%)
├─hash rows of quotient segments 340.45ms 1 0.42% (hash – 3.17%)
├─Merkle tree 395.03ms 1 0.49% (hash – 3.68%)
├─out-of-domain rows 536.75ms 1 0.66%
├─Fiat-Shamir 86.13µs 1 0.00% (hash – 0.00%)
├─linear combination 2.89s 1 3.57%
│ ├─base 224.35ms 1 0.28% (CC – 3.02%)
│ ├─ext 286.31ms 1 0.35% (CC – 3.86%)
│ └─quotient 1.12s 1 1.38% (CC – 15.03%)
├─DEEP 1.16s 1 1.44%
│ ├─base&ext curr row 366.18ms 1 0.45%
│ ├─base&ext next row 398.12ms 1 0.49%
│ └─segmented quotient 398.08ms 1 0.49%
├─combined DEEP polynomial 1.03s 1 1.27%
│ └─sum 949.48ms 1 1.17% (CC – 12.80%)
├─FRI 1.49s 1 1.84%
└─open trace leafs 1.47ms 1 0.00%
### Categories
LDE 34.31s 42.37%
gen 20.26s 25.02%
hash 10.74s 13.27%
CC 7.42s 9.16%
Clock frequency is 12648 Hz (1024009 clock cycles / 80958 ms / 1 iterations)
Optimal clock frequency is 12952 Hz (1048576 padded height / 80958 ms / 1 iterations)
FRI domain length is 2^23
The line you added also makes it glaringly obvious that every element is in fact written to before it is read from, which I like tremendously.
This fix uses the unsafe keyword, but we are convinced that this is safe because all elements are being written to before read from.