Performance of functions vs inlining

chancehudson commented 3 months ago

Is there a performance advantage to using functions for repetitive code? e.g. is one of the following programs more performant than the other?

From my understanding, they should both have similar program counters at the end of execution, with the exception that the program with the jumps has 4 more instructions (the returns and the calls).

call push_things
call push_things
halt

push_things:
  push 10
  push 20
  push 30
  push 40
  push 50
  return

push 10
push 20
push 30
push 40
push 50
push 10
push 20
push 30
push 40
push 50
halt

chancehudson commented 3 months ago

Another example, say i'm implementing a loop where the number of iterations is known at compile time. Is it strictly more performant to unroll this? The call, skiz, recurse and return instructions would be removed, but the program source length would be ~10x longer (number of iterations = 10).

push 0
push 10
call loop_____0
halt

loop_____0:
push 1000
dup 2
push 1
add
swap 3
pop 1
dup 1
push -1
add
dup 0
swap 3
pop 1
dup 2
swap 2
pop 1
pop 1
skiz
recurse
pop 1
return

Sword-Smith commented 3 months ago

Is there a performance advantage to using functions for repetitive code? e.g. is one of the following programs more performant than the other?

From my understanding, they should both have similar program counters at the end of execution, with the exception that the program with the jumps has 4 more instructions (the returns and the calls).

I implemented both examples of your 1st comment in tasm-lib here. The benchmarks show:

no unrolling

  {
    "name": "tasmlib_performance_experiments_no_unrolling",
    "benchmark_result": {
      "clock_cycle_count": 17,
      "hash_table_height": 12,
      "u32_table_height": 0,
      "op_stack_table_height": 10,
      "ram_table_height": 0
    },
  },

with unrolling

    "name": "tasmlib_performance_experiments_with_unrolling",
    "benchmark_result": {
      "clock_cycle_count": 13,
      "hash_table_height": 18,
      "u32_table_height": 0,
      "op_stack_table_height": 10,
      "ram_table_height": 0
    },

So you're spot-on with your predicted difference of 4 clock cycles. Unrolling saves 4 clock cycles in this case. Notice, however, that the hash_table_height is greater in the example with unroling. That's because the VM first needs to hash the entire program that it is running, and this costs in terms of hash-table rows. So you do pay a price for a bigger program.

What's the optimal solution depends on your concrete problem. It's my experience that loop-unrolling is very often (almost always?) worth it. Does that conclusion transfer to all practical programs? I don't know.

Notice that tasm-lib performs its tests and benchmarks by wrapping everything in a call, which should give rise to two extrace clock cycles than you would expect, just in case you're counting and can't make sense of the absolute numbers.

Sword-Smith commented 3 months ago

Another example, say i'm implementing a loop where the number of iterations is known at compile time. Is it strictly more performant to unroll this? The call, skiz, recurse and return instructions would be removed, but the program source length would be ~10x longer (number of iterations = 10).

I would say it's usually strictly more performant to unroll a loop, yes. A common pattern is:

// _ i num_iterations a b c
loop_label:
    // _ i num_iterations a b c

    dup 4
    dup 4
    eq
    skiz
       return
    // _ i num_iterations a b c

    <loop body>
    // _ i num_iterations e f g

    /* Update `i` */
    swap 4
    push 1
    add
    swap 4
    // _ (i + 1) num_iterations e f g

    recurse

If the number of loop iterations is known at compile-time, you save 9 clock cycles per loop iteration in the above example. Through clever use of pointer values and recurse_or_return that number can often be reduced to 3 and in some cases only 1. So whether 1, 3, or 9 is a big number depends on the size of your loop body. If your loop body is 50 or more instructions, maybe it doesn't really matter? But if it's only a few instructions, you certainly want to avoid this book-keeping, if performance is something you care about.

Sword-Smith commented 3 months ago

Every 10 words of your program size, cost 6 hash-table rows. Every instruction that does not take an argument has a size of 1 word, every instruction that does take an argument is 2 words.

Notice that if you're not doing anything related to hashing, (using the instructions hash, merkle_step, sponge_absorb, sponge_squeeze), even writing straight-line code with no loops or branches, clock cycle count will (almost always) dominate the hash-table row count. Imagine your program is 1.000 instructions with an average size of 1.5 word per instruction. Then your hash-table row count from hashing the program will be $1000 1.5 6 / 10 = 900$, whereas your clock cycle count will be $1.000$. And since you're only paying for the highest row count (in terms of time it takes to generate the proof), only the clock cycle count matters in this case.

TL;DR: Loop-unrolling is almost always worth it in terms of performance.

A benchmarks showcasing the impact of program size is also available on tasm-lib's performance-experiments branch.

chancehudson commented 3 months ago

Great explanation, thank you. I was wondering exactly how program length impacts proving time, and the explanation of hash table height is extremely clear.

When looking at the numbers, clock_cycle_count, hash_table_height, u32_table_height, etc. Am I correct in assuming there are significant slowdowns at each power of 2? e.g. if I change a program and it was originally using 1000 clock cycles, and now it's using 1050 cycles, it would be more than a 5% slowdown because the ntt's must be twice as large? Would I see much of a slowdown if the increase was only to 1015 cycles, or e.g. 550 cycles -> 1000 cycles? Assuming all other table heights are lower than the cycle count.

jan-ferdinand commented 3 months ago

Am I correct in assuming there are significant slowdowns at each power of 2?

Absolutely correct. In fact, power-of-2 boundaries are all that matters in practice; any other slowdown due to increased runtime[^1] pales in comparison.

If you're curious about the time taken in the various steps of proof generation (or verification), you can enable the internal performance profiler. Should that be of interest, take a look at the module triton_vm::profiler. :blush:

[^1]: or more generally, max table height

jan-ferdinand commented 3 months ago

To give you an example of what such a profile could look like, here is the output of cargo bench --bench prove_fib --no-default-features:

### Prove Fibonacci 100                        11.07s    #Reps   Share  Category
├─Fiat-Shamir: claim                          715.32µs     163   0.01%  (hash –  0.03%)
├─derive additional parameters                  2.11ms     163   0.02%  
├─base tables                                   3.25s      163  29.35%  
│ ├─create                                    467.00ms     163   4.22%  (gen  – 28.92%)
│ ├─pad                                       267.86ms     163   2.42%  (gen  – 16.59%)
│ │ ├─pad original tables                     132.58ms     163   1.20%  
│ │ └─fill degree-lowering table              134.32ms     163   1.21%  
│ ├─randomize trace                           156.40ms     163   1.41%  (gen  –  9.69%)
│ ├─LDE                                       707.90ms     163   6.39%  (LDE  – 32.91%)
│ │ ├─polynomial zero-initialization          799.89µs     163   0.01%  
│ │ ├─interpolation                            77.32ms     163   0.70%  
│ │ ├─resize                                  303.61ms     163   2.74%  
│ │ ├─evaluation                              323.50ms     163   2.92%  
│ │ └─memoize                                  62.94µs     163   0.00%  
│ ├─Merkle tree                                 1.05s      163   9.49%  (hash – 40.77%)
│ │ ├─leafs                                   734.26ms     163   6.63%  
│ │ └─Merkle tree                             316.27ms     163   2.86%  
│ ├─Fiat-Shamir                                 4.61ms     163   0.04%  (hash –  0.18%)
│ └─extend                                    594.26ms     163   5.37%  (gen  – 36.81%)
│   ├─initialize master table                 213.48ms     163   1.93%  
│   ├─slice master table                      568.21µs     163   0.01%  
│   ├─all tables                              283.04ms     163   2.56%  
│   └─fill degree lowering table               96.41ms     163   0.87%  
├─ext tables                                    1.62s      163  14.62%  
│ ├─randomize trace                           129.05ms     163   1.17%  (gen  –  7.99%)
│ ├─LDE                                       496.85ms     163   4.49%  (LDE  – 23.10%)
│ │ ├─polynomial zero-initialization          186.74µs     163   0.00%  
│ │ ├─interpolation                            66.09ms     163   0.60%  
│ │ ├─resize                                  225.86ms     163   2.04%  
│ │ ├─evaluation                              202.08ms     163   1.82%  
│ │ └─memoize                                  70.24µs     163   0.00%  
│ ├─Merkle tree                               968.58ms     163   8.75%  (hash – 37.57%)
│ │ ├─leafs                                   656.25ms     163   5.93%  
│ │ └─Merkle tree                             311.80ms     163   2.82%  
│ └─Fiat-Shamir                                24.11ms     163   0.22%  (hash –  0.94%)
├─quotient calculation (cached)               832.10ms     163   7.51%  (CC   – 66.04%)
│ ├─zerofier inverse                          260.23ms     163   2.35%  
│ └─evaluate AIR, compute quotient codeword   571.21ms     163   5.16%  
├─quotient LDE                                946.18ms     163   8.54%  (LDE  – 43.99%)
├─hash rows of quotient segments              191.95ms     163   1.73%  (hash –  7.45%)
├─Merkle tree                                 319.44ms     163   2.88%  (hash – 12.39%)
├─out-of-domain rows                          320.16ms     163   2.89%  
├─Fiat-Shamir                                  17.64ms     163   0.16%  (hash –  0.68%)
├─linear combination                          425.32ms     163   3.84%  
│ ├─base                                      100.12ms     163   0.90%  (CC   –  7.95%)
│ ├─ext                                        46.36ms     163   0.42%  (CC   –  3.68%)
│ └─quotient                                  182.07ms     163   1.64%  (CC   – 14.45%)
├─DEEP                                        389.35ms     163   3.52%  
│ ├─base&ext curr row                         159.92ms     163   1.44%  
│ ├─base&ext next row                         117.83ms     163   1.06%  
│ └─segmented quotient                        110.99ms     163   1.00%  
├─combined DEEP polynomial                     99.66ms     163   0.90%  
│ └─sum                                        99.43ms     163   0.90%  (CC   –  7.89%)
├─FRI                                           1.20s      163  10.84%  
└─open trace leafs                             63.82ms     163   0.58%  

### Categories
hash     2.58s  23.28%
LDE      2.15s  19.42%
gen      1.61s  14.58%
CC       1.26s  11.38%

Clock frequency is 14851 Hz (1009 clock cycles / (11074 ms / 163 iterations))
Optimal clock frequency is 15072 Hz (1024 padded height / (11074 ms / 163 iterations))
FRI domain length is 2^13

As you can see, witness generation does amount for roughly 15% of the total time, but low-degree extension "LDE", which is where the NTTs are at, and hashing dominate.

chancehudson commented 3 months ago

These timings look really good. Can you recommend any specific crates offhand for writing granular timings like this?

jan-ferdinand commented 2 months ago

We're rolling our own solution for Triton VM. I'll make sure to forward your compliment to @aszepieniec. :blush:

We used to publish the profiler as its own crate[^1] but merged it back into the Triton VM crate to simplify our engineering efforts. We might publish it again at some point, but honestly, it's not really a priority… You are, of course, free to take the code from triton_vm::profiler and adapt it to your needs.

I've also seen rather pretty results from using the tracing ecosystem. When experimenting with it for a bit, I couldn't figure out how to accumulate the runtimes of function calls in loops, but that might just be a lack of understanding on my side.

[^1]: the readme there is not correct; somehow I had never noticed that

TritonVM / triton-vm

Performance of functions vs inlining #319