TritonVM / triton-vm

Triton is a virtual machine that comes with Algebraic Execution Tables (AET) and Arithmetic Intermediate Representations (AIR) for use in combination with a STARK proof system.
https://triton-vm.org
Apache License 2.0
247 stars 37 forks source link

Introduce instructions `pick` + `i` and `place` + `i` #297

Closed jan-ferdinand closed 2 months ago

jan-ferdinand commented 5 months ago

The two new instructions simplify manipulation of the operational stack. Instruction pick + i moves the indicated stack element to the top of the stack. Instruction place + i is its dual, moving the top of the stack to the indicated position.

old op stack new op stack
pick + i e.g., _ d x c b a e.g., _ d c b a x
place + i e.g., _ d c b a x e.g., _ d x c b a
jan-ferdinand commented 3 months ago

Arguments in favor of introducing the new instructions:

Arguments against introducing the new instructions:

[^1]: Replacing the two mentioned patterns with pick & place would save an estimated 1.6% the number of executed instructions. [^2]: yes, actually

aszepieniec commented 3 months ago

To make the strongest possible argument in favor of this PR, one has to argue that the following cost metrics either don't increase or are worth it:

Do we have numbers or arguments relating to these cost metrics?

jan-ferdinand commented 3 months ago

The arithmetization overview can answer the first two questions:

The arithmetization overview also tells us that evaluating the new constraints would introduce 2382 additional rows in the processor table when evaluating the AIR in assembly.

The question regarding opcode space is slightly more involved because of instruction categories. The relevant categories are currently (without pick and place) filled as follows:

u32? shrink? size num
non-u32 no shrink single word 13
non-u32 no shrink double word 7
non-u32 shrink stack single word 11
non-u32 shrink stack double word 3
u32 no shrink single word 4
u32 no shrink double word 0
u32 shrink stack single word 4
u32 shrink stack double word 0

Within each category, the maximum is 16. The instructions pick and place both fall in the category (non-u32, no-shrink, double word), increasing its size from 7 to 9.

aszepieniec commented 3 months ago

To merge or not to merge? Here is an argument in favor worth considering.

The 1.6% fewer cycles translates to rows in the processor table. Assume that translates to a smaller table height before padding. (And even if this assumption is wrong the argument still kinda holds.) We get this reduced row count in exchange for 0.7% (=1 - 607/611) performance drop in two big steps, namely NTT and row hashing. While this PR does not move the row count across the next power-of-two threshold, it can achieve that in combination with other similar improvements.

I propose to merge this and future similar improvements. And then, at a later stage when we crossed the next power of two boundary, we can figure out which instructions to drop without crossing it back. In the mean time we can

Sword-Smith commented 3 months ago

To merge or not to merge? Here is an argument in favor worth considering.

The 1.6% fewer cycles translates to rows in the processor table. Assume that translates to a smaller table height before padding. (And even if this assumption is wrong the argument still kinda holds.) We get this reduced row count in exchange for 0.7% (=1 - 607/611) performance drop in two big steps, namely NTT and row hashing. While this PR does not move the row count across the next power-of-two threshold, it can achieve that in combination with other similar improvements.

How did @jan-ferdinand get to the 1.6 % fewer cycles? Do they include the added cost of the AIR-constraint evaluation? Because of this AIR-constraint evaluation cost, this PR adds 2 % to the opstack table 0.8 % to the processor table, and 0.6 % to the memory table. With the above-mentioned 0.7 % of additional prover time because of extra columns, I would say merging this is neutral from a performance perspective (assuming I didn't misunderstand anything).

I'm in favor of advancing on the consensus programs and only after those have been written, reconsider these instruction. But it's not a strong preference. I'm happy to let the majority, or just @jan-ferdinand, decide.

Sword-Smith commented 3 months ago

For reference, here is a table row count for the current verifier, run on Triton-VM 0.42 alpha 6.

[
  {
    "name": "tasmlib_verifier_stark_verify_inner_padded_height_524288_fri_exp_4",
    "benchmark_result": {
      "clock_cycle_count": 285183,
      "hash_table_height": 234025,
      "u32_table_height": 211197,
      "op_stack_table_height": 235124,
      "ram_table_height": 281599
    },
    "case": "CommonCase"
  }
]
jan-ferdinand commented 3 months ago

Please note that yesterday's comment was not meant to try to conclude the question regarding merging this PR; I mostly wanted to jot down some findings wherever they fit best. Some of this is here. However, more of the findings are in this comment on #259, where I try to analyse the recufier's execution trace in order to identify common patterns that might be worth introducing additional instructions for. The method is somewhat crude right now, and might not be the right tool for all questions in this line.

jan-ferdinand commented 3 months ago

How did @jan-ferdinand get to the 1.6 % fewer cycles?

I analysed how often a swap pattern that could be replaced by a pick & place pattern occured when executing the recufier, then used the respective patterns lengths as well as the total number of executed instructions.

Do the [1.6% reduced number of instructions] include the added cost of the AIR-constraint evaluation?

No. That's part of the (admittedly overly short) con “arithmetic overhead.”