Instruction Maybe-Wishlist for Recursion and Consensus

aszepieniec commented 3 months ago

This is a tracking issue. We add imagined instructions that could make recursion (or consensus programs) faster.

`read_mem_forward` ❌

BEFORE: *ptr
AFTER: (*ptr+n) [element]
With this instruction, 7 instructions inside the inner loop of (a static variant of) HashVarlen becomes 3.
Deprecated:
- hash_var_len is only performance critical when the number of to-be-hashed elements is known a-priori, making other approaches feasible
- dot_step (see below) eliminates the second next important use for read_mem_forward

`dot_step` ✅

BEFORE: _ acc2 acc1 acc0 *lhs *rhs
AFTER: _ acc2' acc1' acc0' (*lhs+3) (*rhs+3)
With this instruction, 8 instructions inside the inner loop of InnerProductOfThreeRowsWithWeights becomes 1. Stands to reduce `1M to ~65000 cycles.
272

`merkle_step` ✅

BEFORE: _ merkle_node_idx [Digest; 5]
AFTER: _ (merkle_node_idx // 2) [Digest'; 5]
Moves up one layer in a Merkle tree. A combination of divine_sibling and hash into one instruction.
Replaces instruction divine_sibling. Instruction hash remains available as-is.
Does not change the height of the op stack, in contrast to divine_sibling and hash, which change the height of the stack by 5 elements each.
Done: 01c04e52df88bd7277c12473c001467dd8369ccf

`get_colinear_y`

BEFORE: _ ax [ay] bx [by] [cx] (possibly different order)
AFTER: _ [cy]
Where $c_y = \frac{a_y-b_y}{a_x-b_x} \cdot (c_x - a_x) + a_y$
This instruction can reduce the inner loop of compute_c_values from 74 instructions to 49. Total cycle count reduction: ~26000.

`recurse_or_return` ✅

BEFORE: _ a b
AFTER: _ a b
Acts like recurse if $a \neq b$. Else, acts like return.
Replaces checks of the form dup <m> dup <n> eq skiz return <loop_body> recurse, reducing the op stack delta of loop maintenanance.
288

`absorb_from_mem` ✅

BEFORE: _ mem_ptr [garbage; 3]
AFTER: _ (mem_ptr - 10) [garbage; 3]
Reads 10 elements from RAM, adds them to the Sponge state, and applies the Tip5 permutation to the Sponge state.
Can be seen as replacement for read_mem 5 read_mem 5 sponge_absorb, albeit not a drop-in replacement due to the [garbage; 3], which is needed for arithmetization reasons.
Reduces the op stack delta from 10 to 0 compared to the instruction sequence it is intended to replace.
Done: 6dd9b54aa5558c369a9525fb294454d929c13a9f

Sword-Smith commented 3 months ago

It turns out the the opstack table height is the bottle-neck in the verifier, not the clock cycle count (which is recorded by the processor table). So counting number of clock cycles that a new instruction would shave off might not be the relevant metric.

I was considering a recurse instruction that acted as the current recurse if the top to stack elements are different and acts as return if the two top stack elements are the same. That would probably shave off a good amount of both opstack table rows since the start of almost all our loops looks like this:

loop_label:
    dup <m>
    dup <n>
    eq
    skiz return

    <loop body>
    recurse

This new recurse instruction would save 4 opstack table rows and 4 clock cycles for each loop iteration.

Sword-Smith commented 1 month ago

I guess my view has changed a bit.

The recurse_or_return I think would work best is one that returns if ST[5] == 1. Otherwise it recurses. That could be used in combination with merkle_step to reduce the loop body to 2 instructions.

A loop that walks up a Merkle tree would then be:

auth_path_loop:
    merkle_step
    recurse_or_return

I understand that this creates a problem for Merkle trees of height 0, with only one node. But I think that's not a practical problem.

edit: but let's see where we stand after having added the dot_steps and the merkle_step.

TritonVM / triton-vm