Closed chancehudson closed 3 months ago
Another example, say i'm implementing a loop where the number of iterations is known at compile time. Is it strictly more performant to unroll this? The call
, skiz
, recurse
and return
instructions would be removed, but the program source length would be ~10x longer (number of iterations = 10).
push 0
push 10
call loop_____0
halt
loop_____0:
push 1000
dup 2
push 1
add
swap 3
pop 1
dup 1
push -1
add
dup 0
swap 3
pop 1
dup 2
swap 2
pop 1
pop 1
skiz
recurse
pop 1
return
Is there a performance advantage to using functions for repetitive code? e.g. is one of the following programs more performant than the other?
From my understanding, they should both have similar program counters at the end of execution, with the exception that the program with the jumps has 4 more instructions (the
return
s and thecall
s).
I implemented both examples of your 1st comment in tasm-lib
here. The benchmarks show:
no unrolling
{
"name": "tasmlib_performance_experiments_no_unrolling",
"benchmark_result": {
"clock_cycle_count": 17,
"hash_table_height": 12,
"u32_table_height": 0,
"op_stack_table_height": 10,
"ram_table_height": 0
},
},
with unrolling
"name": "tasmlib_performance_experiments_with_unrolling",
"benchmark_result": {
"clock_cycle_count": 13,
"hash_table_height": 18,
"u32_table_height": 0,
"op_stack_table_height": 10,
"ram_table_height": 0
},
So you're spot-on with your predicted difference of 4 clock cycles. Unrolling saves 4 clock cycles in this case. Notice, however, that the hash_table_height
is greater in the example with unroling. That's because the VM first needs to hash the entire program that it is running, and this costs in terms of hash-table rows. So you do pay a price for a bigger program.
What's the optimal solution depends on your concrete problem. It's my experience that loop-unrolling is very often (almost always?) worth it. Does that conclusion transfer to all practical programs? I don't know.
Notice that tasm-lib
performs its tests and benchmarks by wrapping everything in a call, which should give rise to two extrace clock cycles than you would expect, just in case you're counting and can't make sense of the absolute numbers.
Another example, say i'm implementing a loop where the number of iterations is known at compile time. Is it strictly more performant to unroll this? The
call
,skiz
,recurse
andreturn
instructions would be removed, but the program source length would be ~10x longer (number of iterations = 10).
I would say it's usually strictly more performant to unroll a loop, yes. A common pattern is:
// _ i num_iterations a b c
loop_label:
// _ i num_iterations a b c
dup 4
dup 4
eq
skiz
return
// _ i num_iterations a b c
<loop body>
// _ i num_iterations e f g
/* Update `i` */
swap 4
push 1
add
swap 4
// _ (i + 1) num_iterations e f g
recurse
If the number of loop iterations is known at compile-time, you save 9 clock cycles per loop iteration in the above example. Through clever use of pointer values and recurse_or_return
that number can often be reduced to 3 and in some cases only 1. So whether 1, 3, or 9 is a big number depends on the size of your loop body. If your loop body is 50 or more instructions, maybe it doesn't really matter? But if it's only a few instructions, you certainly want to avoid this book-keeping, if performance is something you care about.
Every 10 words of your program size, cost 6 hash-table rows. Every instruction that does not take an argument has a size of 1 word, every instruction that does take an argument is 2 words.
Notice that if you're not doing anything related to hashing, (using the instructions hash
, merkle_step
, sponge_absorb
, sponge_squeeze
), even writing straight-line code with no loops or branches, clock cycle count will (almost always) dominate the hash-table row count. Imagine your program is 1.000 instructions with an average size of 1.5 word per instruction. Then your hash-table row count from hashing the program will be $1000 1.5 6 / 10 = 900$, whereas your clock cycle count will be $1.000$. And since you're only paying for the highest row count (in terms of time it takes to generate the proof), only the clock cycle count matters in this case.
TL;DR: Loop-unrolling is almost always worth it in terms of performance.
A benchmarks showcasing the impact of program size is also available on tasm-lib
's performance-experiments
branch.
Great explanation, thank you. I was wondering exactly how program length impacts proving time, and the explanation of hash table height is extremely clear.
When looking at the numbers, clock_cycle_count
, hash_table_height
, u32_table_height
, etc. Am I correct in assuming there are significant slowdowns at each power of 2? e.g. if I change a program and it was originally using 1000 clock cycles, and now it's using 1050 cycles, it would be more than a 5% slowdown because the ntt's must be twice as large? Would I see much of a slowdown if the increase was only to 1015 cycles, or e.g. 550
cycles -> 1000
cycles? Assuming all other table heights are lower than the cycle count.
Am I correct in assuming there are significant slowdowns at each power of 2?
Absolutely correct. In fact, power-of-2 boundaries are all that matters in practice; any other slowdown due to increased runtime[^1] pales in comparison.
If you're curious about the time taken in the various steps of proof generation (or verification), you can enable the internal performance profiler. Should that be of interest, take a look at the module triton_vm::profiler
. :blush:
[^1]: or more generally, max table height
To give you an example of what such a profile could look like, here is the output of cargo bench --bench prove_fib --no-default-features
:
### Prove Fibonacci 100 11.07s #Reps Share Category
├─Fiat-Shamir: claim 715.32µs 163 0.01% (hash – 0.03%)
├─derive additional parameters 2.11ms 163 0.02%
├─base tables 3.25s 163 29.35%
│ ├─create 467.00ms 163 4.22% (gen – 28.92%)
│ ├─pad 267.86ms 163 2.42% (gen – 16.59%)
│ │ ├─pad original tables 132.58ms 163 1.20%
│ │ └─fill degree-lowering table 134.32ms 163 1.21%
│ ├─randomize trace 156.40ms 163 1.41% (gen – 9.69%)
│ ├─LDE 707.90ms 163 6.39% (LDE – 32.91%)
│ │ ├─polynomial zero-initialization 799.89µs 163 0.01%
│ │ ├─interpolation 77.32ms 163 0.70%
│ │ ├─resize 303.61ms 163 2.74%
│ │ ├─evaluation 323.50ms 163 2.92%
│ │ └─memoize 62.94µs 163 0.00%
│ ├─Merkle tree 1.05s 163 9.49% (hash – 40.77%)
│ │ ├─leafs 734.26ms 163 6.63%
│ │ └─Merkle tree 316.27ms 163 2.86%
│ ├─Fiat-Shamir 4.61ms 163 0.04% (hash – 0.18%)
│ └─extend 594.26ms 163 5.37% (gen – 36.81%)
│ ├─initialize master table 213.48ms 163 1.93%
│ ├─slice master table 568.21µs 163 0.01%
│ ├─all tables 283.04ms 163 2.56%
│ └─fill degree lowering table 96.41ms 163 0.87%
├─ext tables 1.62s 163 14.62%
│ ├─randomize trace 129.05ms 163 1.17% (gen – 7.99%)
│ ├─LDE 496.85ms 163 4.49% (LDE – 23.10%)
│ │ ├─polynomial zero-initialization 186.74µs 163 0.00%
│ │ ├─interpolation 66.09ms 163 0.60%
│ │ ├─resize 225.86ms 163 2.04%
│ │ ├─evaluation 202.08ms 163 1.82%
│ │ └─memoize 70.24µs 163 0.00%
│ ├─Merkle tree 968.58ms 163 8.75% (hash – 37.57%)
│ │ ├─leafs 656.25ms 163 5.93%
│ │ └─Merkle tree 311.80ms 163 2.82%
│ └─Fiat-Shamir 24.11ms 163 0.22% (hash – 0.94%)
├─quotient calculation (cached) 832.10ms 163 7.51% (CC – 66.04%)
│ ├─zerofier inverse 260.23ms 163 2.35%
│ └─evaluate AIR, compute quotient codeword 571.21ms 163 5.16%
├─quotient LDE 946.18ms 163 8.54% (LDE – 43.99%)
├─hash rows of quotient segments 191.95ms 163 1.73% (hash – 7.45%)
├─Merkle tree 319.44ms 163 2.88% (hash – 12.39%)
├─out-of-domain rows 320.16ms 163 2.89%
├─Fiat-Shamir 17.64ms 163 0.16% (hash – 0.68%)
├─linear combination 425.32ms 163 3.84%
│ ├─base 100.12ms 163 0.90% (CC – 7.95%)
│ ├─ext 46.36ms 163 0.42% (CC – 3.68%)
│ └─quotient 182.07ms 163 1.64% (CC – 14.45%)
├─DEEP 389.35ms 163 3.52%
│ ├─base&ext curr row 159.92ms 163 1.44%
│ ├─base&ext next row 117.83ms 163 1.06%
│ └─segmented quotient 110.99ms 163 1.00%
├─combined DEEP polynomial 99.66ms 163 0.90%
│ └─sum 99.43ms 163 0.90% (CC – 7.89%)
├─FRI 1.20s 163 10.84%
└─open trace leafs 63.82ms 163 0.58%
### Categories
hash 2.58s 23.28%
LDE 2.15s 19.42%
gen 1.61s 14.58%
CC 1.26s 11.38%
Clock frequency is 14851 Hz (1009 clock cycles / (11074 ms / 163 iterations))
Optimal clock frequency is 15072 Hz (1024 padded height / (11074 ms / 163 iterations))
FRI domain length is 2^13
As you can see, witness generation does amount for roughly 15% of the total time, but low-degree extension "LDE", which is where the NTTs are at, and hashing dominate.
These timings look really good. Can you recommend any specific crates offhand for writing granular timings like this?
We're rolling our own solution for Triton VM. I'll make sure to forward your compliment to @aszepieniec. :blush:
We used to publish the profiler as its own crate[^1] but merged it back into the Triton VM crate to simplify our engineering efforts. We might publish it again at some point, but honestly, it's not really a priority… You are, of course, free to take the code from triton_vm::profiler
and adapt it to your needs.
I've also seen rather pretty results from using the tracing
ecosystem. When experimenting with it for a bit, I couldn't figure out how to accumulate the runtimes of function calls in loops, but that might just be a lack of understanding on my side.
[^1]: the readme there is not correct; somehow I had never noticed that
Is there a performance advantage to using functions for repetitive code? e.g. is one of the following programs more performant than the other?
From my understanding, they should both have similar program counters at the end of execution, with the exception that the program with the jumps has 4 more instructions (the
return
s and thecall
s).