0xPolygonMiden / miden-vm

STARK-based virtual machine

MIT License

632 stars 161 forks source link

New operations to speed up random linear combination #869

Open bobbinth opened 1 year ago

bobbinth commented 1 year ago

As a part of recursive proof verification, we need to compute random linear combinations of all trace columns for all queried values. Mathematically, this is described as:

$$ \sum_{i=0}^k{\left(\alpha_i \cdot \frac{T_i(x) - T_i(z)}{x - z} +\beta_i \cdot \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$

Where $T_i$ is a trace polynomial, $x$ is a query location, $z$ is the out-of-domain point, and $\alpha_i$ and $\beta_i$ are random coefficients.

Currently, we have 72 base filed columns and 9 extension field columns. Thus, for each query we need to perform almost 200 subtraction, 200 multiplications, and 100 additions (I don't count devisions as they can be done twice per query). If we do this naively, this will probably require a couple thousand cycles per query, and since we need about 27 queries, this computation alone could require 60K+ cycles.

However, we can introduce two new VM operations to speed this computation up dramatically. Let's say our goal is to do this in about 10K cycles. We'll need two operations: one to work with main trace columns (in the base field) and the other one to work with the auxiliary trace columns (in the extension field). We'll also need to modify the way in which we generate random values (more on this later). Below, I'm going to describe only the first operation (the one working with base field columns), as the second operation would work very similarly.

The basic idea is that we'd like to combine this new operation (let's call it RCOMB1) with the already existing MSTREAM operation. As a reminder, MSTREAM works as follows:

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, a, 0, 0, 0] -> [v7, v6, v5, v4, v3, v2, v1, v0, 0, 0, 0, 0, a+2, 0, 0, 0]

Where values $v_0, ..., v_7$ are located in memory starting at the address $a$. Said another way, MSTREAM loads two word from memory starting at address $a$ and increments $a$ by two to point at the next word pair.

What RCOMB1 could do is follow the MSTREAM operation to compute the numerators of the above expression in a single cycle and add them into a running sum. Specifically, denoting running sum for the first term as $p$ and running sum of the second term as $r$, we want it to do the following:

$$ p' = p + \alpha_i \cdot (T_i(x) - T_i(z)) $$

$$ r' = r + \beta_i \cdot (T_i(x) - T_i(z \cdot g)) $$

Once we do the above for all $i$, we can get the final result by computing:

$$ \frac{p}{x - z} + \frac{r}{x - z \cdot g} $$

And this can be computed using regular instructions.

Operation description

Let's assume that we use MSTREAM operation to load 8 values of $T_i(x)$ onto the stack. We'd like to follow it up with 8 invocations of RCOMB1 to "absorb" all 8 values into the running sums, and then we can repeat the process. This would look like so:

MSTREAM RCOMB1 RCOMB1 RCOMB1 RCOMB1 RCOMB1 RCOMB1 RCOMB1 RCOMB1
MSTREAM RCOMB1 RCOMB1 RCOMB1 RCOMB1 RCOMB1 RCOMB1 RCOMB1 RCOMB1
...

To make the above work, here is how we can define RCOMB1 operation:

In the above:

T0, ..., T7 are the $T_i(x)$ values loaded by calling MSTREAM prior to RCOMB1 invocations.
p0, p1 is the running sum $p$ (we need two elements since $p$ is in the extension field).
r0, r1 is the running sum $r$ (it is also in the extension field).
x_addr is the memory address from which we are loading $T_i(x)$ values.
z_addr is the memory address from which we are loading $T_i(z)$ and $T_i(z \cdot g)$ values.
Tz0, Tz1 and Tzg0, Tzg1 are the $T_i(z)$ and $T_i(z \cdot g)$ values respectively.
a0, a1 is a random coefficient $\alpha$.

Conceptually, the operation would work as follows:

First, we load values of $T_i(z)$ and $T_i(z \cdot g)$ from memory address specified by z_addr into helper registers. This assumes that these values are stored in a single word. This is not how it works currently, but it shouldn't be too difficult to change this.
- We also increment z_addr so that it points to the next word for the subsequent calls to RCOMB1.
Then, we compute $p' = p + \alpha_i \cdot (T_i(x) - T_i(z))$ and $r' = r + \alpha_i^2 \cdot (T_i(x) - T_i(z \cdot g))$. The value of $T_i(x)$ is taken from the 8th stack slot (i.e., T0 in the picture above). Notice that we use $\alpha_i^2$ here instead of $\beta_i$. Again, this is not how it works currently, but we should be able to change this as well.
- We also prepare the next value of $\alpha$ as $\alpha_i' = \alpha_i^3$. One potential concern is that this makes the power of $\alpha$ grow very quickly - i.e., after 8 invocations of RCOMB1 it will be $3^8$. We should understand if this is an issue.
Lastly, we rotate the first 8 elements on the stack so that T1 moves into the 8th stack slot to prepare for the next invocation of RCOMB1.

Given the above expressions, transition max constraint degree for this operation should be 3.

Cycle estimate

Assuming the above works, we can expect the following costs per query:

Main trace:

Some setup costs to prepare the stack - probably under 20 cycles.
Computing running sums $p$ and $r$ as described above - since we have 72 columns, this would require 81 cycles (i.e., for each group of 8 columns we need one invocation of MSTREAM and 8 invocations of RCOMB1).
Computing $\frac{p}{x - z} + \frac{r}{x - z \cdot g}$ - probably under 50 cycles.

Overall, we get about 150 cycles per query, or about 4K cycles for 27 queries.

Auxiliary trace: Let's round up the number of auxiliary trace columns from 9 to 12 to be conservative. The costs would be:

Some setup costs to prepare the stack - probably under 20 cycles.
Computing running sums $p$ and $r$ as described above - since we have 12 columns, this would require 15 cycles (i.e., for each group of 4 columns we need one invocation of MSTREAM and 4 invocations of RCOMB2).
Computing $\frac{p}{x - z} + \frac{r}{x - z \cdot g}$ - probably under 50 cycles.

Summing the above together for 27 queries we get about 2.3K cycles.

So, for both main and auxiliary trace columns, we should be able to compute random linear combinations in under 7K cycles.

Al-Kindi-0 commented 1 year ago

This is great! I think that this is definitely the way to go but I have a few remarks:

The above relies heavily on MSTREAM and the ability to load from memory into helper registers using a pointer on the top of the stack. I don't remember having seeing these operations before, do we have them documented somewhere?
The way the random challenge $\alpha$ is updated should be changed to be $\alpha_n = \alpha_o * \alpha^2$ where $\alpha_o$ is the old value and $\alpha_n$ is the new value. This also explains why the powers of $\alpha$ should not grow exponentially. One potential way to solve this would be to keep the (one and only) $\alpha$ in the helper registers. Since we use only one $\alpha$ this should be done only once but I am not sure if this can be done in the first place.

Al-Kindi-0 commented 1 year ago

To simplify notation, we can write

$$ \sum_{i=0}^k{\left(\alpha_i \cdot \frac{T_i(x) - T_i(z)}{x - z} +\beta_i \cdot \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$

$$ \sum_{i=0}^k{\left(\alpha_i \cdot A_i +\beta_i \cdot B_i \right)} $$

Now, if we decide to go with one random value, call it $\alpha$, to compute the whole sum we can write the above as

$$ \sum_{i=0}^k{\left(\alpha^{2i} \cdot A_i +\alpha^{2i+1} \cdot B_i \right)} $$

which is equal to

$$ \sum_{i=0}^k{\alpha^{2i} \cdot Ai} +\alpha\left(\sum{i=0}^k{\alpha^{2i} \cdot B_i}\right) $$

and can be rewritten as

$$ \sum_{i=0}^k{\zeta^{i} \cdot Ai} +\alpha\left(\sum{i=0}^k{\zeta^{i} \cdot B_i}\right) $$

where $\zeta := \alpha^2$. Now we see that we have two polynomials evaluated in $\zeta$ and we can use Horner's rule to make the instructions depend only on $\alpha$. Indeed, the update equations can be changed to: $$p' = p \cdot \zeta + A_i$$ and $$r' = r \cdot \zeta+ B_i$$

and the final step $$\frac{p}{x - z} + \alpha\frac{r}{x - z \cdot g}$$

One cumbersome thing about the above is the fact that we need the values to be in reverse order, hence I am not sure how this would fit into the whole thing. However, there is a nice solution which is to generate the randomness in the reverse order. This unfortunately will necessitate that we change this at the level of the prover too.

bobbinth commented 1 year ago

The above relies heavily on MSTREAM and the ability to load from memory into helper registers using a pointer on the top of the stack. I don't remember having seeing these operations before, do we have them documented somewhere?

It actually doesn't seem like we have to documented anywhere besides the code itself (i.e., here). It is a part of the mem_stream assembly instruction which is basically MSTREAM HPERM.

The way the random challenge α is updated should be changed to be αn=αo∗α2 where αo is the old value and αn is the new value. This also explains why the powers of α should not grow exponentially. One potential way to solve this would be to keep the (one and only) α in the helper registers. Since we use only one α this should be done only once but I am not sure if this can be done in the first place.

This would be somewhat tricky to do. We can't really put $\alpha$ in the helper registers because these don't persist from one cycle to the next and we don't have anything on the main part of the stack that could "bind" helper register values to the value of $\alpha$ (the reason we can do this with $T_i(z)$ values is because z_addr is on the stack).

One way we can do it is by modifying RCOMB1 operation to look as follows:

Here, we put both $\alpha$ and $\alpha_i$ onto the stack. This allows us to compute the next value of $\alpha_i$ as:

$$ \alpha_i' = \alpha_i \cdot \alpha^2 $$

But this comes at the expense of reducing the number of $T_i(x)$ values on the stack from 8 to 6. The other 2 values would need to be handled manually (i.e., by saving them to local memory or something like that). This feels a bit "hacky" and also adds extra cycles - i.e., we'd probably go from 7K cycles to 10K cycles.

Another approach is to use your observation about Horner evaluation and use the same $\zeta$ for both running sums. For example, we could pre-compute even powers of $\alpha$ and save them to memory. Then, instead of putting $\alpha$ in the main stack, we'd put memory address for it onto the main stack and the actual value into helper registers. This modified operation would look like so:

Having to have the value in reverse order is indeed a problem as I'm not sure how to do that efficiently.

Al-Kindi-0 commented 1 year ago

Another approach is to use your observation about Horner evaluation and use the same ζ for both running sums. For example, we could pre-compute even powers of α and save them to memory. Then, instead of putting α in the main stack, we'd put memory address for it onto the main stack and the actual value into helper registers. This modified operation would look like so:

Why not put $\zeta$ (or in fact just $\alpha$ at cost a higher degree constraint) on the stack instead of a_addr? That's the actually the point with using Horner. Putting the pointer to a series of powers of $\alpha$ would in fact allows to avoid the problem of reading the values in reverse order since we won't need to use Horner.

Al-Kindi-0 commented 1 year ago

One more thing which is both simplifying and good for the soundness of FRI. According to equation (11) here, @ulrich-haboeck shows that one can simplify the problem to be

$$ \sum_{i=0}^k{\alpha^i \left( \frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$

This reduces the degree of the constraints as well.

bobbinth commented 1 year ago

Why not put $\zeta$ (or in fact just $\alpha$ at cost a higher degree constraint) on the stack instead of a_addr? That's the actually the point with using Horner.

If we put something else in place of a_addr then we can't put the $a_i$ value in the helper registers - something on the main stack should define which values go into the helper registers.

Putting the pointer to a series of powers of $\alpha$ would in fact allows to avoid the problem of reading the values in reverse order since we won't need to use Horner.

Yeah, as I think about it more, reading $\alpha$ values from memory seems to be the best approach. So, the instruction would look like this:

If we compute random liner combination as:

$$ \sum_{i=0}^k{\alpha_i \left( \frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$

We can actually use independently drawn random values and don't need to give up any bits of security.

Or even if we have to use different values for left and right term, we can use just a single power like so:

$$ \sum_{i=0}^k{\left( \alpha_i \frac{T_i(x) - T_i(z)}{x - z} + \alpha_i^2 \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$

In the above, all $\alpha_i$ would still be independently drawn - but we use a square of $\alpha_i$ for the second term. Security impact of this should be negligible, I believe (but would be good to confirm).

The only drawback of going with this approach that I can think of is that we need to perform an extra memory read (2 reads total) per cycle. But we already do this for MMSTREAM - so, most of the machinery for this has already been built.

Al-Kindi-0 commented 1 year ago

If we put something else in place of a_addr then we can't put the $a_i$ value in the helper registers - something on the main stack should define which values go into the helper registers.

I am not sure I follow, with Horner there is no need for $a_i$ in the first place. Only the original random value need to be preserved on the stack. This is because the update equations, as mentioned above, are:

$$ p = \frac{T_i(x) - T_i(z)}{x - z} + p\cdot\alpha $$

and

$$ r = \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g}+ r\cdot\alpha $$

bobbinth commented 1 year ago

Ah yes! This does mean that if we go with Horner evaluation we'd be able to use a single $\alpha$ on the stack. This would simplify things. The only challenge remaining with Horner approach is that column values need to be in reverse order - or can we avoid it somehow?

Al-Kindi-0 commented 1 year ago

Unfortunately, I don't see a way to avoid that in a clean way. Moreover, using powers of a single random value for FRI batching degrades the soundness of the commit phase in a non-negligible way. This might not be a problem when using the cubic extension field but is a significant issue for the quadratic extension case (in the list-decoding regime). Thus I think that keeping a pointer to a list of random values $\alpha_i$, one for each term of the random linear combination, is the way to go. The last remaining question is then which of the two ways of using $\alpha_i$ should we use. If the second option (the one without $\alpha_i^2$) works then I think that it should be the preferred option. I have run some initial calculations and it seems that the soundness of

$$ \sum_{i=0}^k{\alpha_i \left( \frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$

and

$$ \sum_{i=0}^k{\alpha_i \left( \frac{T_i(x) - T_i(z)}{x - z} \right) + \beta_i \left(\frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$

is indistinguishable in our setting provided the argument for using the first form is valid.

Al-Kindi-0 commented 1 year ago

One small issue relating to the random elements $\alpha_i$ in

$$ \sum_{i=0}^k{\alpha_i \left( \frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$

relates to the fact that for each column $i$ we are using one $\alpha_i$ but the $\alpha_i$'s are usually stored as pairs [ɑ0, ɑ1, ɑ0', ɑ1'].

This can be overcome by adding an additional input (the last available slot on the stack) to the candidate instruction that is a boolean variable $b$ that alternates between $0$ and $1$ to indicate which of the two alphas to choose.

bobbinth commented 1 year ago

One small issue relating to the random elements $\alpha_i$

Ah - indeed! Thank you for catching it!

This can be overcome by adding an additional input (the last available slot on the stack) to the candidate instruction that is a boolean variable $b$ that alternates between $0$ and $1$ to indicate which of the two alphas to choose.

Unfortunately, I'm not sure we'll be able to do it this way. The main reason is that to prove consistency of memory reads we need to provide values for all 4 elements of the word (so, we need to $\alpha_0$, $\alpha_1$, $\alpha_0'$, $\alpha_1'$ somewhere on the stack or into helper registers). There are two potential ways to handle this:

Assume that randomness is stored as [ɑ0, ɑ1, 0, 0]. This will work for memory consistency check (as we can hardcode zeros into the constraints) but will generating saving random values more expensive (probably doubling the cost).
Try to change the semantics of the instruction somehow to make space for extra two values. Not sure how to do this - but if we can come up with something efficient here this would be the preferred approach.

Al-Kindi-0 commented 1 year ago

If we are willing to "hard-code" the width of the trace then we can drop either z_addr or a_addr, by laying out the randomness right after the OOD frame, and use an offset instead. Then we would have two available slots on the stack which we can use for keeping the other random element ($\beta$). In this case, the sum that makes most sense is

$$ \sum_{i=0}^k{\left(\alpha_i \cdot \frac{T_i(x) - T_i(z)}{x - z} +\beta_i \cdot \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$

This, unfortunately, will mean that we will have to double the number of random elements.

bobbinth commented 1 year ago

If we are willing to roughly double the cost of generating randomness for this part (which I think in our case would result in about 500 extra cycles), I'd probably look into creating a version of this procedure to save randomness to memory as [ɑ0, ɑ1, 0, 0]. For example, we could change this line to something like:

save R1 to mem[dest] and mem[dest + 1]
swapw movup.3 movup.3 push.0.0 dup.15 mem_storew
movup.5 movup.5 movup.3 movup.3 dup.15 add.1 mem_storew drop drop

The above adds 11 cycles to the loop (and doesn't handle everything) - but maybe there is a better way to do something like that.

Al-Kindi-0 commented 1 year ago

I agree, this has also the advantage of not needing any "hardcoding". All in all, I think this concludes the last remaining question/issue related to the design of this new op.

Al-Kindi-0 commented 11 months ago

Some different variations on the above problem are gathered in what follows. This is done in order to inform the design of the proposed instruction in the hope of making it as general/useful as possible.

Revisiting the Problem

Fix an $x$ in the LDE domain and let

$$ \mathsf{S} := \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)}. $$

Then

\begin{aligned}
  \mathsf{S} &= \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} \right)} + \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} \\
  &=  \sum_{i=0}^k {\left(\frac{\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z)}{x - z} \right)} + \sum_{i=0}^k{\left(\frac{\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z \cdot g)}{x - z \cdot g} \right)} \\
  &=  \frac{1}{x - z}\sum_{i=0}^k \left(\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z) \right) +\frac{1}{x - z \cdot g} \sum_{i=0}^k\left(\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z \cdot g) \right) \\
  &= \frac{1}{x - z} \left(\mathsf{ip}_x - \mathsf{ip}_z \right) +\frac{1}{x - z \cdot g} \left(\mathsf{ip}_x - \mathsf{ip}_{gz} \right)
\end{aligned}

$$\mathsf{ip}x := \langle\alpha{\cdot}, T{\cdot}(x)\rangle = \sum{i=0}^k\alpha_i\cdot T_i(x)$$ $$\mathsf{ip}z := \langle\alpha{\cdot}, T{\cdot}(z)\rangle := \sum{i=0}^k\alpha_i\cdot Ti(z)$$ $$\mathsf{ip}{gz} := \langle\alpha{\cdot}, T{\cdot}(z \cdot g)\rangle := \sum_{i=0}^k\alpha_i\cdot T_i(z \cdot g)$$

Notice the following:

$\mathsf{ip}_z$ and $\mathsf{ip}_{gz}$ are independent of $x$ and as such they can be computed once and for all FRI queries.
The only terms dependent on $x$ are $\mathsf{ip}_x$, $\frac{1}{x - z}$ and $\frac{1}{x - gz}$.
Computing $\mathsf{ip}_x$, $\mathsf{ip}_z$ or $\mathsf{ip}_{gz}$ amounts to the same as computing an inner product of two vectors.

One Divisior

Instead of working with $$\sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - Ti(z \cdot g)}{x - z \cdot g} \right)}.$$ we can work instead with $$\sum{i=0}^k\alpha_i {\frac{T_i(x) - p_i(x)}{(x - z)(x - z \cdot g)}}$$

where $p_i(x) = a_i + b_i\cdot x$ is the line interpolating ${\left(z, T_i(z)\right), \left(z \cdot g, T_i(z \cdot g)\right)}$.

As above, this can be written as $$\frac{1}{(x - z)(x - z \cdot g)} \left(\mathsf{ip}_x - \left(\mathsf{ip}_a + x\mathsf{ip}_b\right)\right)$$ Again:

$\mathsf{ip}_a$ and $\mathsf{ip}_{b}$ are independent of $x$ and as such they can be computed once and for all FRI queries.
The only terms dependent on $x$ are $\mathsf{ip}_x$, $\frac{1}{(x - z)(x - z \cdot g)}$ and the evaluation $\left(\mathsf{ip}_a + x\mathsf{ip}_b\right)$.
Computing $\mathsf{ip}_x$, $\mathsf{ip}_a$ or $\mathsf{ip}_{b}$ amounts to the same as computing an inner product of two vectors.

The advantage of the one-quotient method over the previous method is namely the fact that it uses only one quotient i.e., one division. This comes at the cost of needing to compute $a_i$ and $b_i$ from $\{\left(z, T_i(z)\right), \left(z \cdot g, T_i(z \cdot g)\right)\}$, but only once for each $i$, and one extension multiply-add field operation to compute $\left(\mathsf{ip}_a + x\mathsf{ip}_b\right)$ at each $x$.

Multipoint Evaluation $`\geq 3`$

Sometimes we might be interested in evaluations of some of the $T_i$'s at more than just $2$ points. For the sake of concretness, assume $T_k$ is evaluated at $4$ points. This means that the last term in the above sum will be $$\alpha_k \left(\frac{T_k(x) - T_k(z_0)}{x - z_0} + \frac{T_k(x) - T_k(z_1)}{x - z_1} + \frac{T_k(x) - T_k(z_2)}{x - z_2} + \frac{T_k(x) - T_k(z_3)}{x - z_3} \right)$$ If we want to avoid the four divisions, we can instead work with $$\frac{\alpha_k}{\left(x - z_0\right)\cdot\left(x - z_1\right)\cdot\left(x - z_2\right)\cdot\left(x - z_3\right)} \left({T_k(x) - p_k(x)} \right)$$ where $p_k$ is the polynomial of degree at most $3$ interpolating the points $(z_i, T_k(z_i))$. Again $p_k$ needs to be computed only once for all $x$'s.

One problem with the above is obtaining $p_k$ from the evaluations $(z_i, T_k(z_i))$ when the number of evaluations is relatively large. Here are two potential solutions to this:

Provide the coefficients of $p_k$ non-deterministically and check correctness via a probabilistic argument. The fact that the $z_i$ do not form a subgroup of the unites makes things less smooth (TBD).
The prover, instead of sending the OOD frame containing $T_k$'s evaluations, sends the coefficients of $p_k$. Remember that $T_k$ is after all just a claim, so sending the coefficients instead of the evaluations is equivalent. One potential downside of this second approach is the fact that, for constraint evaluations, we will need to evaluate $p_k$ at the $z_i$'s, but this is usually done only once.

Al-Kindi-0 commented 11 months ago

Regarding the question of computing the coefficients of $p_k$ polynomial non-deterministically, the following is an attempt at solving it. Assume that the following procedure costs one cycle

#! +-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+---+
#! |  C01  |  C00  |  C11  |  C10  |  C21  |  C20  |  C31  |  C30  |  x1  |  x0  | acc1 | acc0 |c_addr| ctr  |   -  | - |
#! +-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+---+
#!
#!                                                       ||
#!                                                       \/
#!
#! +---+-------+-------+-------+-------+-------+-------+-------+-------+------+------+-------+-------+------+-----+-----+
#! | b |  C01  |  C00  |  C11  |  C10  |  C21  |  C20  |  C31  |  C30  |  x1  |  x0  | acc1' | acc0' |c_addr| ctr'|  -  |
#! +---+-------+-------+-------+-------+-------+-------+-------+-------+------+------+-------+-------+------+-----+-----+
#!
#! where:
#!
#! 1. acc' = (acc0', acc0') := ((((acc * x + c3) * x + c2) * x + c1) * x) + c0
#! 2. ctr' = ctr + 4
#! 3. b = 0 if ctr' == 0 else 1
#!
#! Cycles: 1
export.horner_eval

end

Then, for a trace of length $2^{20}$ i.e., $21$ evaluation points, we can do as follows:

#! Input: [COEFF_PTR, EVALS_PTR, z1, z0, ...]
#! Output: [...]
#! Cycles: 1350
export.check_transformation_21

    # 0) Push the number of coefficients for later Horner evaluation
    push.24 neg
    dup.1
    #=> [COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]

    # 1) Use `adv_pipe` to load the coefficients
    ## a) Prepare stack
    dup padw padw padw

    ## b) Load the coefficients form the advice stack
    # The coefficients are loaded in reverse order and they are padded with 0-coefficients
    # for the highest monomials in order to make the number of coefficients divisible by 4
    repeat.6
        adv_pipe hperm
    end
    #=> [Y, y, y, t1, t0, Y, y, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]

    # 2) Evaluate the polynomial using the coefficients at (t0, t1)
    ## a) Reset the pointer to point to the coefficients
    swapw.3 drop dup.1
    #=> [Y, y, y, t1, t0, Y, COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]

    ## b) Prepare the stack for Horner evaluation by creating an accumulator to hold the evaluation
    swapw drop drop push.0 movdn.2 push.0 movdn.2
    #=> [t1, t0, 0, 0, Y, Y, COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]
    swapw.2
    #=> [Y, Y, t1, t0, 0, 0, COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]

    ## c) Evaluate
    push.1
    while.true
        mem_stream horner_eval
    end
    #=> [Y, Y, t1, t0, res1, res0, COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...] where (res0, res1) is the evaluation of the polynomial
    # at (t0, t1) using the coefficients.

    ## d) Clean up the stack
    dropw dropw 
    #=> [t1, t0, res1, res0, COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]
    swapw drop drop swap
    #=> [EVALS_PTR, COEFF_PTR, t1, t0, res1, res0, z1, z0, ...]

    # 3) Evaluate the polynomial using the Barycentric (2nd) formula at t/z where t is the random challenge
    #   and z is the OOD point. We evaluate at t/z in order to have the evaluation points of the Barycentric
    #   formula be independent of z, and hence we can hardcode the evaluation points and the barycentric weights
    #   for increased performance.

    ## a) Compute t/z
    movup.3 movup.3 movup.7 movup.7
    #=> [z1, z0, t1, t0, EVALS_PTR, COEFF_PTR, res1, res0, ...]
    ext2div ext2mul
    #=> [zeta1, zeta0, EVALS_PTR, COEFF_PTR, res1, res0, ...]

    ## b) Create accumulators
    padw
    #=> [q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR, COEFF_PTR, res1, res0, ...] where p holds the numerator and q holds the denominator

    ## c) Load evaluations i.e. OOD values
    padw dup.10 add.1 swap.11
    #=> [v1, v0, u1, u0, q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]

    ## d) Compute the 0-th term
    ### i) move up zeta to top
    dup.9 dup.9
    #=> [zeta1, zeta0, v1, v0, u1, u0, q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]

    ### ii) Push the 0-th evaluation point xj
    swap push.1 neg add swap
    #=> [zeta1, zeta0 - xj0, v1, v0, u1, u0, q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
    ext2inv
    #=> [den1, den0, v1, v0, u1, u0, q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...] where (den0, den1) = (zeta0, zeta1) - (xj, 0)

    ### iii) Push the 0-th barycentric weight fj
    push.8475095067679657783 push.0 ext2mul
    #=> [fj1, fj0, v1, v0, u1, u0, q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...] where fj = wj/(zeta - xj)

    ### iv) add fj to p
    movup.9 movup.9 dup.3 dup.3 ext2add movdn.9 movdn.9
    #=> [fj1, fj0, v1, v0, u1, u0, q1, q0, p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]

    ### v) multiply fj by u and add to q
    movup.5 movup.5 ext2mul movup.5 movup.5 ext2add movdn.3 movdn.3
    #=> [v1, v0, q1', q0', p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]

    ## e) Compute the 0-th term
    ### i) Move up zeta to top
    dup.7 dup.7
    #=> [zeta1, zeta0, v1, v0, q1', q0', p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]

    ### ii) Push the 1-st evaluation point xj
    swap push.1971462654193939361 neg add swap
    #=> [zeta1, zeta0 - xj0, v1, v0, q1', q0', p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
    ext2inv
    #=> [den1, den0, v1, v0, q1', q0', p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...] where (den0, den1) = (zeta0, zeta1) - (xj, 0)

    ### iii) Push the 1-st barycentric weight wj
    push.1813978751105035889 push.0 ext2mul
    #=> [fj1, fj0, v1, v0, q1', q0', p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...] where fj = wj/(zeta - xj)

    ### iv) Add fj to p
    movup.7 movup.7 dup.3 dup.3 ext2add movdn.7 movdn.7
    #=> [fj1, fj0, v1, v0, q1', q0', p1'', p0'', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]

    ### v) multiply fj by v and add to q
    ext2mul ext2add
    #=> [q1'', q0'', p1'', p0'', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]

    ### vi) update pointer

    # Repeat c) d) and e) 10 times and finish with c) and d) each time with the appropriate set of constants

    #=> [q1'', q0'', p1'', p0'', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]

    # 3) Compute the barycentric evaluation i.e., p / q
    movup.3 movup.3 ext2inv ext2mul
    #=> [res_bar1, res_bar0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]

    # 4) Assert equality
    movup.6 assert_eq movup.5 assert_eq

    # 5) Clean up the stack
    dropw
    #=> []
end

The above uses the fact that, for the barycentric evaluation formula, we can factorize the OOD point z and thus make the procedure, for a given trace length, parametric only over z. The following code illustrate this idea.

Al-Kindi-0 commented 11 months ago

One other point that was worth investigating relates to doing away with RCOMB2. The motivation behind this is the potential simplifications to the auxiliary trace that might bring the number of auxiliary columns to just 2 with one of these columns having OOD evaluations at only $z$ and $g\cdot z$.

The following is an attempt at computing the coefficients of the polynomial corresponding to the auxiliary column that is being opened at $z$ and $g\cdot z$. This method is then used (see below) in computing the terms in the random linear combination associated with this column.

#!  Compute the slope and offset of the line for a typical column in auxiliary trace
#!
#!  we know:
#!  1. the pointer to OOD frame containing `t_i(z)` and `t_i(gz)`  `ood_trace_ptr`
#!  2. the pointer to the current trace row `current_trace_row_ptr`
#!  3. the pointer to the OOD point `z` `z_ptr`
#!  4. the pointer to the trace domain generator `g` `trace_domain_gen_ptr`
#!
#! The equation of the line is: y = a.x + b with:
#!
#!  1. a = (a0, a1) := (t_i(z) - t_i(gz)) / (z - gz)
#!  2. b = (b0, b1) := (gt(z) - t(gz)) / (g - 1)
#!
#!  Input: [offset, ...]
#!  Output: [a1, a0, b1, b0, ...]
#!
#!  Cycles: 64
export.compute_slope_bias

    # 1) Load OOD evaluations
    padw
    movup.4 push.ood_trace_ptr add
    mem_loadw
    #=> [T11, T10, T01, T00, ...] where (T00, T01) = t_i(z) and (T10, T11) = t_i(gz)

    # 2) Compute b 
    # Load g
    push.trace_domain_gen_ptr
    #=> [g, T11, T10, T01, T00, ...]

    # Compute numerator of b
    dup.4 dup.1 mul
    #=> [g.T00, g, T11, T10, T01, T00, ...]
    dup.4 dup.2 mul
    #=> [g.T01, g.T00, g, T11, T10, T01, T00, ...]
    dup.4 neg dup.4 neg
    #=> [-T11, -T10, g.T01, g.T00, g, T11, T10, T01, T00, ...]
    ext2add
    #=> [num1, num0, g, T11, T10, T01, T00, ...] where num = (num0, num1) := gt(z) - t(gz)
    movup.2 sub.1 inv mul
    #=> [b1, b0, T11, T10, T01, T00, ...] where b = (b0, b1) := (gt(z) - t(gz)) / (g - 1)

    # 3) Compute a 
    # Compute gz - z
    padw push.Z_PTR 
    #=> [gz1, gz0, z1, z0, b1, b0, T11, T10, T01, T00, ...]
    ext2sub
    #=> [a_deno_1, a_deno_0, b1, b0, T11, T10, T01, T00, ...] where (a_deno_0, a_deno_1) := z - gz
    # Compute t_i(z) - t_i(gz)
    swapw ext2sub
    #=> [a_numer_1, a_numer_0, a_deno_1, a_deno_0, b1, b0, ...] where (a_numer_0, a_numer_1) := t_i(z) - t_i(gz)
    movup.3 movup.3 ext2inv ext2mul
    #=> [a1, a0, b1, b0, ...] where (a0, a1) := (t_i(z) - t_i(gz)) / (z - gz)
end

In addition to the terms related to the aforementioned 2 auxiliary trace columns, the terms related to the columns of the constraint composition polynomial are also currently computed using RCOMB2. The following is a proposal for computing these terms without the helper instruction:

#! Computes the random linear combination of the terms related to the constraint composition polynomials.
#!
#! Input: [x, x_addr, c_addr, a_addr, ...]
#! Output: #=> [x, final_result1, final_result0, ...]
#!
#! Cycles: ~340
export.rand_combine_constraint
    movdn.3
    #=> [x_addr, c_addr, a_addr, x, ...]

    push.0 push.0
    #=> [acc1, acc0, x_addr, c_addr, a_addr, x, ...]

    repeat.4
        padw dup.6 add.1 swap.7 mem_loadw
        #=> [t11, t10, t01, t00, acc1, acc0, x_addr, c_addr, a_addr, x, ...] where (t00, t01) = C_i(x) and (t10, t11) = C_{i+1}(x)

        padw dup.11 add.1 swap.12 mem_loadw
        #=> [w11, w10, w01, w00, t11, t10, t01, t00, acc1, acc0, x_addr, c_addr, a_addr, x, ...] where (w00, w01) = C_i(z) and (w10, w11) = C_{i+1}(z)

        movup.7 movup.7 movup.5 movup.5
        #=> [w01, w00, t01, t00, w11, w10, t11, t10, acc1, acc0, x_addr, c_addr, a_addr, x, ...]
        ext2sub
        #=> [u1, u0, w11, w10, t11, t10, acc1, acc0, x_addr, c_addr, a_addr, x, ...] where (u0, u1) = C_i(x) - C_i(z)

        movup.5 movup.5 movup.5 movup.5
        #=> [w11, w10, t11, t10, u1, u0, acc1, acc0, x_addr, c_addr, a_addr, x, ...]

        ext2sub
        #=> [v1, v0, u1, u0, acc1, acc0, x_addr, c_addr, a_addr, x, ...]  where (v0, v1) = C_{i+1}(x) - C_{i+1}(z)

        padw dup.12 add.1 swap.13 mem_loadw
        #=> [b1, b0, a1, a0, v1, v0, u1, u0, acc1, acc0, x_addr, c_addr, a_addr, x, ...] where (a0, a1) = alpha_i and (b0, b1) = alpha_{i+1}

        movup.5 movup.5 ext2mul
        #=> [(v*b)1, (v*b)0, a1, a0, u1, u0, acc1, acc0, x_addr, c_addr, a_addr, x, ...]

        movdn.5 movdn.5
        #=> [a1, a0, u1, u0, (v*b)1, (v*b)0, acc1, acc0, x_addr, c_addr, a_addr, x, ...]

        ext2mul
        #=> [(a*u)1, (a*u)0, (v*b)1, (v*b)0, acc1, acc0, x_addr, c_addr, a_addr, x, ...]

        ext2add ext2add
        #=> [acc1', acc0', x_addr, c_addr, a_addr, x, ...]
    end

    #=> [acc1', acc0', x_addr, c_addr, a_addr, x, ...]

    movdn.4 movdn.4
    #=> [x_addr, c_addr, a_addr, acc1', acc0', x, ...]
    push.0 push.Z_PTR mem_loadw drop drop
    #=> [z1, z0, acc1', acc0', x, ...]

    neg swap neg dup.4 add
    #=> [-z1, x-z0, acc1', acc0', x, ...]

    ext2inv ext2mul movup.2
    #=> [final_result1, final_result0, x, ...]
end

Al-Kindi-0 commented 11 months ago

In the case of the so-called Lagrange kernel, and assuming that we have already loaded and checked the coefficients of the polynomial $p_k$ for this column, the following is a proposal for how to combine the terms related to the first auxiliary column (opened at $z$ and $g\cdot z$) and the second auxiliary column (i.e., Lagrange kernel column, opened at $z$, $g\cdot z$, $g^2\cdot z$ ... $g^{\log_2(n) - 1}\cdot z$:

#! Input: [x, x_addr, c_addr, a_addr, ...]
#! Output: [res1, res0, ...] 
#! where (res0, res1) = alpha_k * (t_k(x) - (a * x + b)) / ((x - z)*(x - gz))
#!                                    +  alpha_{k+1} * (t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i))/((x - z)*...*(x - zg^(2^(nu-1))))
#!
#! Cycles: 210 + log(trace_len) * 26
export.rand_combine_aux_multi_openings.2

    # 1) Store x and the pointers                           (4 cycles)
    loc_storew.0

    # 2) Compute alpha_k *(t_k(x) - (a * x + b))            (64 cycles)            

    ## a) setup the stack

    ### i) `ctr`: The number of coefficients is 4 since horner_eval works with batches of 4 coefficients
    ### and the next multiple of 4 from 2 is 4. `ctr` is not used until the second aux column as the number
    ### of evaluation points is dynamic and equal to log(trace_length)
    push.TRACE_LENGTH_LOG_PTR mem_load push.4 add neg
    #=> [ctr, x, x_addr, c_addr, a_addr, ...]

    ### ii) `c_addr`
    movup.3
    #=> [c_addr, ctr, x, x_addr, a_addr, ...]

    ### iii) `acc`
    push.0.0
    #=> [0, 0, c_addr, ctr, x, x_addr, a_addr, ...]

    ### iv) `x`
    movup.4
    #=> [x, 0, 0, c_addr, ctr, x_addr, a_addr, ...]

    ### v) prepare for `mstream`
    movup.6 movup.6 push.0.0 padw
    #=> [Y, Y, x, 0, 0, c_addr, ctr, ...] where Y are values that can be overwritten by mstream

    ## b) compute a * x + b
    mstream horner_eval drop
    #=> [Y, Y, x, acc1, acc0, c_addr + 2, ctr + 4, ...] where Y are values that can be overwritten by mstream

    ## c) compute t_k(x) - (a * x + b)
    loc_loadw.0
    #=> [x, x_addr, c_addr, a_addr, Y, x, acc1, acc0, c_addr + 2, ctr + 4, ...]

    ### i) load t_k(x)
    push.0.0 movup.3 mem_loadw drop drop
    #=> [t_k(x)1, t_k(x)0, a_addr, Y, x, acc1, acc0, c_addr + 2, ctr + 4, ...]

    ### ii) reset acc and move current acc value to top of stack
    push.0 swap.7 push.0 swap.7 
    #=> [acc1, acc0, t_k(x)1, t_k(x)0, a_addr, Y, x, 0, 0, c_addr + 2, ctr + 4, ...]

    ### iii) perform subtraction
    ext2sub
    #=> [res1, res0, a_addr, Y, x, 0, 0, c_addr + 2, ctr + 4, ...] where (res0, res1) = t_k(x) - (a * x + b)

    ## d) multiply by alpha_k
    ### i) load alpha_k   
    padw movup.6 mem_loadw push.0.0 swapw
    #=> [alpha_k1, alpha_k0, res1, res0, 0, 0, y, y, Y, x, 0, 0, c_addr + 2, ctr + 4, ...]

    ### ii) multiply and store result
    ext2mul loc_storew.1
    #=> [out_k1, out_k0, 0, 0, y, y, Y, x, 0, 0, c_addr + 2, ctr + 4, ...] where (out_k0, out_k1) = alpha_k *(t_k(x) - (a * x + b))

    ### iii) clean up stack for next term in random linear combination
    drop drop
    #=> [Y, Y, x, 0, 0, c_addr + 2, ctr + 4, ...]

    # 3) Compute alpha_{k+1} * (t_{k+1}(x) - (sum_0^{nu - 1} c_i * x^i))    (45 + log(trace_len)/4)

    ## a) compute sum_0^{nu + 1} c_i * x^i
    push.1
    while.true
        mstream horner_eval
    end
    #=> [Y, Y, x, acc1, acc0, c_addr + nu, 0, ...]

    ## b) compute t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i)
    ### i) get pointer to t_{k+1}(x) and store c_addr + nu
    dropw loc_loadw.0
    #=> [x, x_addr, c_addr, a_addr, x, acc1, acc0, c_addr + nu, 0, ...]
    movup.7 swap.3 movdn.7 loc_storew.0
    #=> [x, x_addr, c_addr + nu, a_addr, x, acc1, acc0, c_addr, 0, ...]

    ### ii) load t_k(x)
    push.0.0 movup.3 mem_loadw
    #=> [t_{k+1}(x)1, t_{k+1}(x)0, y, y, a_addr, x, acc1, acc0, c_addr, 0, ...]

    ### iii) subtract
    movup.7 movup.7
    #=> [acc1, acc0, t_{k+1}(x)1, t_{k+1}(x)0, y, y, a_addr, x, c_addr, 0, ...]
    ext2sub
    #=> [res1, res0, y, y, a_addr, x, c_addr, 0, ...] where (res0, res1) =  t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i) 

    ## c) compute alpha_{k+1} * res
    ### i) load alpha
    swapw push.0 swap mem_loadw
    #=> [alpha1, alpha0, y, y, res1, res0, y, y, ...] where (res0, res1) =  t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i) 

    ### ii) multiply
    movup.5 movup.5 ext2mul
    #=> [outkk1, outkk0, y, y, y, y, ...] where (outkk0, outkk1) = alpha_{k+1} *(t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i))

    # 4) Compute prod_1^{nu - 1} (x - zg^(2^i))                 (18 cycles + 26 * log(trace_len))

    ## a) load g (the trace domain generator) and compute g^2
    push.trace_domain_gen_ptr mem_load dup mul
    #=> [g^2, outkk1, outkk0, y, y, y, y, ...]

    ## b) load x and z
    loc_load.0 swapw push.Z_PTR drop drop
    #=> [z1, z0, x, g^2, outkk1, outkk0, ...]

    ## c) compute the counter for the while loop (TODO: make it dynamic)
    ## we have points g^(2^i) for i=1..(nu - 1) where nu = log(n). Taking nu=21, implies ctr is 20.
    ## TODO: make sure the nu - 1 is a multiple of 2
    push.0 push.TRACE_LENGTH_LOG_PTR sub movdn.4 
    #=> [z1, z0, x, g^2, ctr, outkk1, outkk0, ...]

    ## d) create product accumulator
    push.1 push.0
    #=> [0, 1, z1, z0, x, g^2, ctr, outkk1, outkk0, ...]

    ## e) compute the product
    push.1
    while.true
        #=> [0, 1, z1, z0, x, g^2, ctr]
        dup.5 dup mul swap.6
        #=> [g^2, 0, 1, z1, z0, x, g^4, ctr]
        dup.4 dup.1 mul
        #=> [z0g^2, g^2, 0, 1, z1, z0, x, g^4, ctr]
        neg dup.6 add
        #=> [x-z0g^2, g^2, 0, 1, z1, z0, x, g^4, ctr]
        dup.4 movup.2 mul
        #=> [z1g^2, x-z0g^2, 0, 1, z1, z0, x, g^4, ctr]
        neg
        #=> [-z1g^2, x-z0g^2, 0, 1, z1, z0, x, g^4, ctr]
        ext2mul
        #=> [acc1, acc0, z1, z0, x, g^4, ctr]
        dup.5 add.1 swap.6 neq.0
    end
    #=> [acc1, acc0, z1, z0, x, g^(2^(nu-1)), 0, outkk1, outkk0, ...]

    # 5) Compute (v0, v1) := alpha_{k+1} * (t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i))/((x - zg^2)*...*(x - zg^(2^(nu-1))))   (15 cycles)
    ext2inv movup.8 movup.8 ext2mul
    #=> [v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...]

    # 6) Compute (x - z) * (x - gz)             (26 cycles)

    ## a) compute x - z
    dup.4 dup.4 sub
    #=> [x - z0, v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...]
    dup.3 neg
    #=> [-z1, x - z0, v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...]

    ## b) compute x - gz
    dup.6 dup.6 mem_load.trace_domain_gen_ptr mul sub
    #=> [x - z0g, -z1, x - z0, v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...]
    dup.5 mem_load.trace_domain_gen_ptr neg
    #=> [-gz1, x - gz0, -z1, x - z0, v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...]

    ## c) compute product
    ext2mul ext2inv
    #=> [tmp1, tmp0, v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...] where (tmp0, tmp1) = ((x - z) * (x - gz))^(-1)

    # 7) Compute the final result               (34 cycles)

    ## a) load (out_k0, out_k1) = alpha_k *(t_k(x) - (a * x + b))
    swapw loc_loadw.1
    #=> [out_k1, out_k0, 0, 0, tmp1, tmp0, v1, v0, 0, ...]

    ## b) add terms before dividing by common divisor
    movup.7 movup.7 ext2add
    #=> [numr1, numr0, 0, 0, tmp1, tmp0, 0, ...]

    ## c) compute final result
    movup.5 movup.5 ext2mul
    #=> [final_res1, final_res0, 0, 0, 0, ...]

    ## d) load pointers, update and return
    movdn.4 movdn.4 push.0 loc_loadw.0
    #=> [x, x_addr, c_addr, a_addr, final_res1, final_res0, ...]
    swap add.1 swap
    #=> [x, x_addr + 1, c_addr, a_addr, final_res1, final_res0, ...]
    swap.3 add.1 swap.3
    #=> [x, x_addr + 1, c_addr, a_addr + 1, final_res1, final_res0, ...]
end

Al-Kindi-0 commented 11 months ago

Putting everything together and assuming the non-deterministic computation of the coefficients of the polynomial associated to the Lagrange kernel happens before the call to compute_deep_composition_polynomial_queries, compute_deep_composition_polynomial_queries given the above can be implemented as follows:

#! Compute the DEEP composition polynomial FRI queries.
#!
#! Input: [query_ptr, ...]
#! Output: [...]
#! Cycles: 6 + num_queries * (1000 + 26 * log(trace_len))
export.compute_deep_composition_polynomial_queries.1
    exec.constants::fri_com_ptr
    dup.1
    #=>[query_ptr, query_end_ptr, ...]

    push.1
    while.true
        # I)
        #
        # Load the (main, aux, constraint)-traces rows associated with the current query and get
        # the index of the query.
        #
        # Cycles: 200
        exec.load_query_row
        #=>[index, query_ptr, query_end_ptr, ...]

        # II)
        #
        # Compute x := offset * domain_gen^index and denominators (x - z) and (x - gz)
        #
        # Cycles: 71
        exec.compute_denominators
        loc_storew.0
        #=> [Z, x, index, query_ptr, query_end_ptr, ...] where Z := [-gz1, x - gz0, -z1, x - z0]

        # III)
        #
        # Prepare to compute the sum \sum_{i=0}^k{\left(\alpha_i \cdot \frac{T_i(x) - T_i(z)}{x - z}
        #            + \alpha_i \cdot \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g}
        # We can factorize (x - z) and (x - gz) and divide the two sums only once and at the end.
        # The two sums are stored in [Acc3, Acc2] and [Acc1, Acc0] respectively.

        ## a) Push pointers
        ##
        ## Cycles: 4
        push.0
        exec.constants::deep_rand_coef_ptr
        exec.constants::ood_trace_ptr
        exec.constants::current_trace_row_ptr
        #=> [P, Z, x, index, query_ptr, query_end_ptr, ...]
        # where P := [CURRENT_TRACE_ROW_PTR, OOD_TRACE_PTR, DEEP_RAND_CC_PTR, 0]

        ## b) Push the accumulators
        ##
        ## Cycles: 4
        padw
        #=> [Acc, P, Z, x, index, query_ptr, query_end_ptr, ...]
        #=> where Acc =: [Acc3, Acc2, Acc1, Acc0]

        ## c) This will be used to mstream the elements T_i(x)
        ##
        ## Cycles: 11
        padw movupw.3
        #=> [Z, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...]

        ## d) Compute the random linear combination for main trace columns
        ##
        ## Cycles: 81
        exec.combine_main_trace_columns
        #=> [Y, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...]

        ## e) Compute the random linear combination for aux trace columns
        ##
        ## Cycles: 225 + log(trace_len) * 26

        dropw dropw swapw swap drop push.COEFF_PTR swap movup.8
        #=> [x, P, Acc, index, query_ptr, query_end_ptr, ...]
        # where P := [CURRENT_TRACE_ROW_PTR, COEFF_PTR, DEEP_RAND_CC_PTR, 0]
        exec.rand_combine_aux_multi_openings
        #=> [x, P, aux_res1, aux_res0, Acc, index, query_ptr, query_end_ptr, ...]

        ## f) Compute the random linear combination for constraint composition columns and add to
        ##  the term related to auxiliary trace columns.
        ##
        ## Cycles: 340

        exec.rand_combine_constraint
        #=> [x, const_comp_res1, const_comp_res0, aux_res1, aux_res0, Acc, index, query_ptr, query_end_ptr, ...]
        movdn.8
        ext2add
        #=> [tmp_res1, tmp_res0, Acc, x, index, query_ptr, query_end_ptr, ...]

        ## g) Load the denominators (x - z) and (x - gz)
        ## Cycles: 10

        movdn.5 movdn.5
        padw loc_loadw swapw
        #=> [Acc, Z, tmp_res1, tmp_res0, x, index, query_ptr, query_end_ptr, ...]

        ## h) Divide by denominators and sum to get final result
        ##
        ## Cycles: 38
        exec.divide_by_denominators_and_sum
        #=> [eval1, eval0, tmp_res1, tmp_res0, x, index, query_ptr, query_end_ptr, ...]

        ## i) Compute the final result
        ##
        ## Cycles: 5
        ext2add
        #=> [res1, res0, x, index, query_ptr, query_end_ptr, ...]

        # IV)
        #
        # Store [poe, index, eval_1, eval_0] where poe := g^index = x / offset and prepare stack
        # for next iteration.

        ## a) Compute poe
        ##
        ## Cycles: 4
        movup.3 movup.3
        exec.constants::domain_offset_inv mul
        #=> [poe, index, eval1, eval0, query_ptr, query_end_ptr, ...]

        ## b) Store [eval0, eval1, index, poe]
        ##
        ## Cycles: 5
        dup.4 add.1 swap.5
        mem_storew
        #=> [poe, index, eval1, eval0, query_ptr+1, query_end_ptr, ...]

        ## c) Prepare stack for next iteration
        ##
        ## Cycles: 8
        dropw
        dup.1 dup.1
        neq
        #=> [?, query_ptr+1, query_end_ptr, ...]
    end
    drop drop
end

Al-Kindi-0 commented 11 months ago

To give a concrete example of the increase in costs observed using the above implementation:

Before: ~14000 cycles for compute_deep_composition_polynomial_queries.
After: ~47000 cycles for compute_deep_composition_polynomial_queries + compute_slope_bias +check_transformation_21

Again in the second bullet point we do away with RCOMB2 but introduce a 1-cycle instruction to do Horner evaluations. Another potential thing that should be investigated is whether we can use RCOMB2 (or a slight modification of it) instead of RCOMB1 and do away with RCOMB1 instead. This might incur some additional cost for main trace columns but this might be justified for the savings we might get in relation to auxiliary and constraint composition columns (TBD).

Al-Kindi-0 commented 10 months ago

The following is an attempt at improving the above and it relies on two ingredients:

RCOMB2 can be modified in order to be useful not only for terms coming from the auxiliary trace but also for the ones associated to the constraint composition polynomial.
The denominators associated to multi-point evaluation terms can be computed using Horner's method.

Modifiying `RCOMB2`

The RCOMB2 instruction currently does the following

$$(r, p) = (r, p) + \left(\alpha_i\cdot\left({T_i(x) - T_i(z)}\right), \alpha_i\cdot\left({T_i(x) - T_i(z \cdot g)} \right)\right)$$

Using RCOMB2 for constraint composition columns is currently not straghtforward and it entails a non-negligable number of stack manipulations. In order to make RCOMB2 work for constraint composition columns without friction, we can update the instruction to compute instead
$$(r, p) = (r, p) + \left(\alpha_i\cdot\left({T_i(x) - T_i(z)}\right), (1 - b)\cdot\alpha_i\cdot\left({T_i(x) - T_i(z \cdot g)} \right)\right)$$ where $b$ is a binary flag that is set to $0$ for auxiliary trace columns and is set to $1$ for constraint composition columns. The price to pay for this adaptation is an increase in the degree of the constraint.

Computing the denominator for multipoint evaluation $`\geq 3`$

In the above code, we computed $$(x - z)\cdot\prod_{i=0}^{\log(n) - 1}\left(x - z\cdot g^{2^i} \right)$$ This required a while loop and it was the main bottleneck when computing the random linear combination for a given $x$. Here is how we can improve on that solution:

The above expression can be written as follows: $$z^{\log(n) + 1}(\frac{x}{z} - 1)\cdot\prod_{i=0}^{\log(n) - 1}\left(\frac{x}{z} - g^{2^i} \right)$$ which can be written more compactly as $$z^{\log(n) + 1}m\left(\frac{x}{z} \right)$$

where $$m(x) = (x - 1)\cdot\prod{i=0}^{\log(n) - 1}\left(x - g^{2^i} \right)$$ Note that $m$ is a polynomial of degree $\log(n) + 1$ and thus can be written as $`m(x) = \Sigma{i=0}^{\log(n) + 1} c_i \cdot x^i`$. Importantly, its coefficients are independent of $z$. This suggests the following procedure to compute our denominator:

For a given $n$, load from the advice provider the coefficients $c_i$ using adv_pipe to a designated region of memory. At the same time, compute the hash of these coefficients and compare this hash with the expected hash for the given $n$. This is done only once.
For the given $n$, compute $z^{\log(n) + 1}$ and store it. This is done only once.
For a given $x$, compute $\frac{x}{z}$ and evaluate $m$ at $\frac{x}{z}$ using Horner to get $m(\frac{x}{z})$.
Multiply $m(\frac{x}{z})$ by $z^{\log(n) + 1}$ to get the value of the denominator at $x$.

In conclusion, at the cost of a slight modification to RCOMB1 and using a new, but versatile, new instruction to do Horner evaluations, we can implement compute_deep_composition_polynomial_queries in approximately 610 cycles per FRI query, in addition to some modest computation that is query independent.

Al-Kindi-0 commented 10 months ago

Here is a proposal for how to adapt compute_deep_composition_polynomial_queries in light of the most recent proposal:

#! Compute the DEEP composition polynomial FRI queries.
#!
#! Input: [query_ptr, ...]
#! Output: [...]
#! Cycles: 6 + num_queries * 610
export.compute_deep_composition_polynomial_queries
    exec.constants::fri_com_ptr
    dup.1
    #=>[query_ptr, query_end_ptr, ...]

    push.1
    while.true
        # I)
        #
        # Load the (main, aux, constraint)-traces rows associated with the current query and get
        # the index of the query.
        #
        # Cycles: 200
        exec.load_query_row
        #=>[index, query_ptr, query_end_ptr, ...]

        # II)
        #
        # Compute x := offset * domain_gen^index and denominators (x - z) and (x - gz)
        #
        # Cycles: 71
        exec.compute_denominators
        push.Z_MINUS_X_PTR mem_storew
        #=> [Z, x, index, query_ptr, query_end_ptr, ...] where Z := [-gz1, x - gz0, -z1, x - z0]

        # III)
        #
        # Prepare to compute the sum \sum_{i=0}^k{\left(\alpha_i \cdot \frac{T_i(x) - T_i(z)}{x - z}
        #            + \alpha_i \cdot \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g}
        # We can factorize (x - z) and (x - gz) and divide the two sums only once and at the end.
        # The two sums are stored in [Acc3, Acc2] and [Acc1, Acc0] respectively.

        ## a) Push pointers
        ##
        ## Cycles: 4
        dup.4
        exec.constants::deep_rand_coef_ptr
        exec.constants::ood_trace_ptr
        exec.constants::current_trace_row_ptr
        #=> [P, Z, x, index, query_ptr, query_end_ptr, ...]
        # where P := [CURRENT_TRACE_ROW_PTR, OOD_TRACE_PTR, DEEP_RAND_CC_PTR, x]

        ## b) Push the accumulators
        ##
        ## Cycles: 4
        padw
        #=> [Acc, P, Z, x, index, query_ptr, query_end_ptr, ...]
        #=> where Acc =: [Acc3, Acc2, Acc1, Acc0]

        ## c) This will be used to mstream the elements T_i(x)
        ##
        ## Cycles: 11
        padw movupw.3
        #=> [Z, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...]

        ## d) Compute the random linear combination for main trace columns
        ##
        ## Cycles: 81
        exec.combine_main_trace_columns
        #=> [Y, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...]

        ## e) Compute the random linear combination for aux trace columns
        ##
        ## Cycles: 150
        exec.combine_aux_trace_columns
        #=> [Y, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...] 

        ## f) Constraint composition polys terms
        ##
        ## Cycles: 10
        repeat.2
            mstream
            repeat.4
                exec.combine_aux
            end
        end
        #=> [Y, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...] 

        ## g) load k_th term
        ##
        ## Cycles: 3
        push.K_TH_TERM_PTR mem_loadw swapw.3
        #=> [P, Y, Acc, k_th_term1, k_th_term0, y, y, x, index, query_ptr, query_end_ptr, ...] 

        ## h) load Z 
        ##
        ## Cycles: 7
        dropw
        push.Z_MINUS_X_PTR mem_loadw
        swapw
        #=> [Acc, Z, k_th_term1, k_th_term0, y, y, x, index, query_ptr, query_end_ptr, ...] 

        ## i) Divide by denominators and sum to get final result
        ##
        ## Cycles: 45
        exec.divide_by_denominators_and_sum
        #=> [eval_tmp1, eval_tmp0, k_th_term1, k_th_term0, y, y, x, index, query_ptr, query_end_ptr, ...]
        ext2add
        #=> [eval1, eval0, y, y, x, index, query_ptr, query_end_ptr, ...]
        movup.3 drop
        movup.3 drop
        #=> [eval1, eval0, x, index, query_ptr, query_end_ptr, ...]

        # IV)
        #
        # Store [poe, index, eval_1, eval_0] where poe := g^index = x / offset and prepare stack
        # for next iteration.

        ## a) Compute poe
        ##
        ## Cycles: 4
        movup.3 movup.3
        exec.constants::domain_offset_inv mul
        #=> [poe, index, eval1, eval0, query_ptr, query_end_ptr, ...]

        ## b) Store [eval0, eval1, index, poe]
        ##
        ## Cycles: 5
        dup.4 add.1 swap.5
        mem_storew
        #=> [poe, index, eval1, eval0, query_ptr+1, query_end_ptr, ...]

        ## c) Prepare stack for next iteration
        ##
        ## Cycles: 8
        dropw
        dup.1 dup.1
        neq
        #=> [?, query_ptr+1, query_end_ptr, ...]
    end
    drop drop
end

# Input: [Y, Y, Acc, P, ...]
# Output: [Y, Y, Acc, P, ...]
# Cycles: 133 for 2^20 trace length and at most 150
export.combine_aux_trace_columns
        mstream exec.random_combinate_2
        #=> [Y, y, y, tk1, tk0, Acc, P, ...] 
        # where (tk0, tk1) is the value of the column which will be opened at >= 3 points

        swapw drop drop
        #=> [tk1, tk0, Y, Acc, P, ...] 

        ### prepare stack for Horner evaluation of p_k at x

        #### load the negative of the number of coefficients padded to the next multiple of 4 
        push.TRACE_LENGTH_PLUS_ONE_NEXT_MUL_4_LOG_PTR mem_load neg
        push.P_K_COEF_PTR
        #=> [c_addr, ctr, tk1, tk0, Y, Acc, P, ...] 

        #### push accumulator and set eval point
        dup.15
        push.0.0
        swap.2
        push.0
        #=> [x1, x0, acc1, acc0, c_addr, ctr, tk1, tk0, Y, Acc, P, ...] 
        # where (x0, x1) = (x, 0)

        #### pad for mem_stream
        movupw.2 padw
        #=> [Y, Y, x1, x0, acc1, acc0, c_addr, ctr, tk1, tk0, Acc, P, ...] 

        ### evaluate
        push.1
        while.true
            mem_stream exec.horner_eval
        end
        #=> [Y, Y, x1, x0, acc1, acc0, c_addr, ctr', tk1, tk0, Acc, P, ...] 

        ### compute tk - p_k(x)
        swapdw
        #=> [x1, x0, acc1, acc0, c_addr, ctr', tk1, tk0, Y, Y, Acc, P, ...] 
        movup.7 neg movup.7 neg
        #=> [-tk1, -tk0, x1, x0, acc1, acc0, c_addr, ctr', Y, Y, Acc, P, ...] 
        movup.5 movup.5
        ext2add 
        #=> [numer1, numer0, x1, x0, c_addr, ctr', Y, Y, Acc, P, ...] 

        push.0.0
        #=> [0, 0, numer1, numer0, x1, x0, c_addr, ctr', Y, Y, Acc, P, ...] 
        swapw
        push.TRACE_LENGTH_PLUS_ONE_NEXT_MUL_4_LOG_PTR mem_load neg
        swap.4 drop
        #=> [x1, x0, c_addr, ctr, 0, 0, numer1, numer0, Y, Y, Acc, P, ...] 
        movup.5 movdn.2
        movup.5 movdn.2
        #=> [x1, x0, 0, 0, c_addr, ctr, numer1, numer0, Y, Y, Acc, P, ...] 

        ### multiply x by z^(-1)
        padw push.Z_INV_PTR mem_loadw drop drop
        #=> [z_inv1, z_inv0, x1, x0, 0, 0, c_addr, ctr, numer1, numer0, Y, Y, Acc, P, ...] 
        ext2mul
        #=> [x'1, x'0, 0, 0, c_addr, ctr, numer1, numer0, Y, Y, Acc, P, ...] 
        # where x' = x / z

        ### prepare stack for evaluating product
        swapdw
        #=> [Y, Y, x'1, x'0, 0, 0, c_addr, ctr, numer1, numer0, Acc, P, ...] 

        ### evaluate
        push.1
        while.true
            mem_stream exec.horner_eval
        end
        #=> [Y, Y, x'1, x'0, tmp_deno1, tmp_deno0, c_addr, ctr, numer1, numer0, Acc, P, ...] 

        swapdw
        #=> [x'1, x'0, tmp_deno1, tmp_deno0, c_addr, ctr, numer1, numer0, Y, Y, Acc, P, ...] 
        movup.5 movup.5
        #=> [c_addr, ctr, x'1, x'0, tmp_deno1, tmp_deno0, numer1, numer0, Y, Y, Acc, P, ...] 

        push.Z_POWER_LOG_TRACE_PLUS_1 mem_loadw drop drop
        #=> [z_pow_1, z_pow_0, tmp_deno1, tmp_deno0, numer1, numer0, Y, Y, Acc, P, ...] 

        ### compute denominator
        ext2mul
        #=> [deno1, deno0, numer1, numer0, Y, Y, Acc, P, ...] 

        ### compute full term of k-th column 
        ext2div
        #=> [k_th_term_tmp1, k_th_term_tmp0, Y, Y, Acc, P, ...] 

        ### multiply with alpha_k
        movup.3 drop movup.3 drop
        #=> [k_th_term_tmp1, k_th_term_tmp0, y, y, Y, Acc, P, ...] 
        swapw
        #=> [Y, k_th_term_tmp1, k_th_term_tmp0, y, y, Acc, P, ...] 
        swapw.3 
        #=> [P, k_th_term_tmp1, k_th_term_tmp0, y, y, Acc, Y, ...]
        # where P := [CURRENT_TRACE_ROW_PTR, OOD_TRACE_PTR, DEEP_RAND_CC_PTR, 0]

        ### update pointers and set flag
        add.1 movdn.3
        push.TRACE_LENGTH_PLUS_ONE_NEXT_DIV_BY_TWO_LOG_PTR mem_load add movdn.3
        add.1 movdn.3
        drop push.1 movdn.3  # flag associated to constraint composition terms
        #=> [P, k_th_term_tmp1, k_th_term_tmp0, y, y, Acc, Y, ...] 

        swapw.3
        #=> [Y, k_th_term_tmp1, k_th_term_tmp0, y, y, Acc, P, ...] 

        ### load alpha_k
        dup.14 sub.1 mem_loadw
        #=> [y, y, alpha_k1, alpha_k0, k_th_term_tmp1, k_th_term_tmp0, y, y, Acc, P, ...] 
        movdn.5 movdn.5
        #=> [alpha_k1, alpha_k0, k_th_term_tmp1, k_th_term_tmp0, Y, Acc, P, ...] 
        ext2mul
        #=> [k_th_term1, k_th_term0, Y, Acc, P, ...] 

        ### store k_th term for later
        push.K_TH_TERM_PTR mem_storew
        #=> [k_th_term1, k_th_term0, Y, Acc, P, ...] 

        ### setup stack to correct shape
        push.0.0
        #=> [Y, Y, Acc, P, ...]
end

bobbinth commented 10 months ago

Looks great! I didn't go through the procedures line-by line - but the overall approach makes sense. To summarize:

The new expression for the DEEP composition polynomial looks something like this:

$$ Y(x) = \sum_{i=0}^k{( \alpha_i \cdot (\frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - Ti(z \cdot g)}{x - z \cdot g}) )} + \beta \cdot \sum{i=0}^{log(n)}\frac{T_l(x) - T_l(z_i)}{x - zi} + \sum{i=0}^m{\gamma_i \cdot \frac{H_i(x) - H_i(z)}{x - z}} $$

Where, $T_l(x)$ is the Lagrange kernel column (in the auxiliary trace), and $z_i$'s come from the following sequence:

$$ z, z \cdot g, z \cdot g^2, z \cdot g^4, ..., z \cdot g^{2^{log(n) - 1}} $$

Adding the term for the Lagrange kernel column adds complexity (the fact that the total number of auxiliary columns goes down to 2 doesn't help us much here). The cost of this complexity is:

The number of cycles for computing the DEEP composition polynomial goes up from $\approx 463$ to $\approx 610$ per FRI query.
We need to introduce a new instruction for Horner evaluation (e.g., heval or maybe horneval). But we'd also need this instruction to help with sumcheck evaluation.

I think it may be possible to reduce the cost per FRI query to $\approx 500$ by introducing one more specialized instruction (specifically for computing the Lagrange kernel column term), but this is an optimization we can do later.

As things stand now, the overall cost of computing DEEP composition polynomial goes up by slightly more than 4K cycles (assuming 28 FRI queries), which I think is acceptable.

A few questions:

What would be the semantics of horneval instruction? In the above procedures, we use them in while loops, and I'm wondering if we can replace that with repeat loops.
I like the modification to the rcomb2 instruction - but as you noted, the degree goes up by 1. What would be the overall degree now?
I think the last open question is how many extra stack cycles would the sumcheck (and other GKR-related) operations take. Do we have an estimate here?

Al-Kindi-0 commented 10 months ago

What would be the semantics of horneval instruction? In the above procedures, we use them in while loops, and I'm wondering if we can replace that with repeat loops.

Here is the proposal described here

#! +-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+---+
#! |  C01  |  C00  |  C11  |  C10  |  C21  |  C20  |  C31  |  C30  |  x1  |  x0  | acc1 | acc0 |c_addr| ctr  |   -  | - |
#! +-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+---+
#!
#!                                                       ||
#!                                                       \/
#!
#! +---+-------+-------+-------+-------+-------+-------+-------+-------+------+------+-------+-------+------+-----+-----+
#! | b |  C01  |  C00  |  C11  |  C10  |  C21  |  C20  |  C31  |  C30  |  x1  |  x0  | acc1' | acc0' |c_addr| ctr'|  -  |
#! +---+-------+-------+-------+-------+-------+-------+-------+-------+------+------+-------+-------+------+-----+-----+
#!
#! where:
#!
#! 1. acc' = (acc0', acc0') := ((((acc * x + c3) * x + c2) * x + c1) * x) + c0
#! 2. ctr' = ctr + 4
#! 3. b = 0 if ctr' == 0 else 1
#!
#! Cycles: 1
export.horner_eval

end

Regarding using a repeat instead of a while loop, it should be doable as Horner's method can accommodate evaluating polynomials of degree less than an upper bound in a pretty straightforward manner. The only thing needed is to pad the high order coefficients to have value zero. This shouldn't complicate things too much as far as I can see. However, having an instruction that can support an arbitrary degree bound is preferable in my opinion. Falcon signature would benefit from such an instruction as an example.

I like the modification to the rcomb2 instruction - but as you noted, the degree goes up by 1. What would be the overall degree now?

The modified instruct does the following $$(r, p) = (r, p) + \left(\alpha_i\cdot\left({T_i(x) - T_i(z)}\right), (1 - b)\cdot\alpha_i\cdot\left({T_i(x) - T_i(z \cdot g)} \right)\right)$$ I am counting here degree $3$ constraints for this, $2$ (extension) field multiplication and 1 (base) field multiplication by $1 - b$. The other thing that is missing is the memory consistency check between the values in the helper registers and the pointers sitting on the stack. As far as I can see these shouldn't affect the count.

I think the last open question is how many extra stack cycles would the sumcheck (and other GKR-related) operations take. Do we have an estimate here?

We should be able to zoom in on a good estimate of that soon.

bobbinth commented 10 months ago

However, having an instruction that can support an arbitrary degree bound is preferable in my opinion. Falcon signature would benefit from such an instruction as an example.

I think we can accomodate arbitrary degree bounds with a repeat loop as well as long as the bound is know at compile time (which I think is the case for Falcon signature?).

The main benefit is that as long as the actual values are not too far from the bound, using repeat should be quite a bit more efficient. For example:

repeat.8
    mem_stream
    horneval
end

Would be exactly 16 cycles, while:

while.true
    mem_stream
    horneval
end

assuming it terminates after 8 iterations, would be 32 cycles (as while.true and end instructions each require a cycle) and would also require 8 hashes to be computed in the hasher chiplet.

Btw, I noticed that as written the degree of the instruction would be 5, which is quite high. But we can reduce it by using some of the output slots for temp values.

Al-Kindi-0 commented 10 months ago

However, having an instruction that can support an arbitrary degree bound is preferable in my opinion. Falcon signature would benefit from such an instruction as an example.

I think we can accomodate arbitrary degree bounds with a repeat loop as well as long as the bound is know at compile time (which I think is the case for Falcon signature?).

Yes, indeed. It is actually much more easier in the context of Falcon than in the context of random linear combinations. The degree is always 512.

The main benefit is that as long as the actual values are not too far from the bound, using repeat should be quite a bit more efficient. For example:
repeat.8
    mem_stream
    horneval
end
Would be exactly 16 cycles, while:
while.true
    mem_stream
    horneval
end
assuming it terminates after 8 iterations, would be 32 cycles (as while.true and end instructions each require a cycle) and would also require 8 hashes to be computed in the hasher chiplet.

Makes sense! Then for all practical trace lengths, using a repeat based on an upper bound should be more performant than the while-loop based solution.

Btw, I noticed that as written the degree of the instruction would be 5, which is quite high. But we can reduce it by using some of the output slots for temp values.

That's correct. I was thinking that we could do something like we did with the folding instruction where we use some of the stack values for degree reduction, as you just described.

Al-Kindi-0 commented 10 months ago

Actually, I forgot to mention that for Falcon, we might gain some performance benefits using Horner evaluation as it will save around 10k cycles that are spent in computing powers of the $\tau$ and setting a region of memory to the all zero polynomial.

bobbinth commented 10 months ago

One thing in relation to Falcon is that operations are in the base field. But I guess we can just pad words in the memory layout?

Al-Kindi-0 commented 10 months ago

As far as I can see, the three polynomials are loaded from the advice tape and hence we can just load them as extension field elements. This will mean that some of the repeat blocks will double but this is still more than compensated for the gains described above.

Al-Kindi-0 commented 10 months ago

Cycle Estimates associated to the GKR-based virtual bus

Sum-check rounds

Say we are working with a trace of length $2^{\nu}$ and we have $2^{\mu}$ fractions of the form $\frac{\mathsf{p}(i0,\dots,i{\mu - 1}, s_1(x0,\dots,x{\nu - 1}),\dots,s_n(x0,\dots,x{\nu - 1}))}{\mathsf{q}(i0,\dots,i{\mu - 1}, s_1(x0,\dots,x{\nu - 1}),\dots,s_n(x0,\dots,x{\nu - 1}))}$. Then we will have $$\rho := \mu + \nu$$ sum-check rounds in the final sum-check for the final GKR layer. We will also have $\rho$-number of rounds in total for all the sum-checks in the remaining GKR layers.

During each of the sum-check rounds in the last GKR layer, the (non-interactive) verifier receives the coefficients of a (univariate) polynomial of degree at most $d := \max_{(i_0,\dots,i_{\mu - 1})\in\{0,1\}^{\mu}}{\mathsf{deg}_{x_i}(\mathsf{p}(i_0,\dots,i_{\mu - 1}, s_1(x_0,\dots,x_{\nu - 1}),\dots,s_n(x_0,\dots,x_{\nu - 1}))) + \mathsf{deg}_{x_i}(\mathsf{q}(i_0,\dots,i_{\mu - 1}, s_1(x_0,\dots,x_{\nu - 1}),\dots,s_n(x_0,\dots,x_{\nu - 1}))) + 1}$ The verifier then:

Processes the coefficients using the (previous) claimed sum. (This should be less than 10 cycles).
Hashes the coefficients of the polynomial of degree at most $d$ and obtains a challenge $\tau$. (Around $d/4$ calls to the hash function).
Evaluates the polynomial at $\tau$. (Around $d/4$ mstream and horneval).

For the remaining $\rho$ sum-check rounds, the verifier does exactly the same as above but works instead with polynomials of degree at most $3$.

Thus the amount of work in cycles that the verifier does in the rounds of sum-check leading to the final GKR check is $$\rho\cdot\left(10 + \frac{d}{2}\right) + \rho\cdot\left(10 + 1\right)$$

The final GKR check

At the end of all the sum-check rounds, the verifier would have sampled $(r0^{'},\dots,r{\mu - 1}^{'}, r0,\dots,r{\nu - 1})$ during the last sum-check. The verifier now:

Receives the claimed evaluations of $s_i$ at $(r0,\dots,r{\nu - 1})$ i.e., $v_i := s_i(r0,\dots,r{\nu - 1})$
Uses the $v_i$'s and $(r0^{'},\dots,r{\mu - 1}^{'})$ to evaluate an algebraic expression and compares the result of the evaluation to the result it obtained during the last step of the last sum-check.
Hashes $vi$'s to obtain a challenge $\lambda$ (or a number of different challenges) and computes $v = \Sigma{i=0}^{n - 1} \lambda^i v_i$.
Asks the STARK verifier to check that the oracle $\Sigma_{i=0}^{n - 1} \lambda^i s_i(x_0,\dots,x_{\nu - 1})$ opens to $v$ at $(r0,\dots,r{\nu - 1})$.

The cost of this step should be comparable to the current cost of constraint evaluation involving auxiliary columns.

The non-final GKR checks

During the sum-checks for the layers before the final layer, the verifier essentially receives $4$ values and reduces these to $1$ value using randomness.

The cost of these steps is basically one call to the hash function per GKR-layer (i.e., $\rho$ layers) and a few extension field operations (2 multiplications and 2 additions), say $20$ cycles.

Total

The total estimated cost of the above, excluding the final GKR check, is approximately $$\rho\cdot\left(21 + \frac{d}{2}\right) + \rho\cdot\left(20 + 1\right)$$ i.e., $$\left(\mu + \nu \right)\cdot\left(42 + \frac{d}{2}\right)$$

For example, if $\mu = 6$, $\nu = 20$ and $d = 18$, we get a count of $1326$ cycles using our approximate estimate and without the cost of the final GKR check.

Al-Kindi-0 commented 10 months ago

Note that, for the final sum-check, some of the polynomials (the last $\mu$ ones) sent to the verifier are of degree at most $3$ and thus the core work performed by the verifier (as well as the prover) is concentrated at the last $\nu$ steps, in addition to the final evaluation check.

0xPolygonMiden / miden-vm

New operations to speed up random linear combination #869

Operation description

Cycle estimate

Revisiting the Problem

One Divisior

Multipoint Evaluation $\geq 3$

Modifiying RCOMB2

Computing the denominator for multipoint evaluation $\geq 3$

Cycle Estimates associated to the GKR-based virtual bus

Sum-check rounds

The final GKR check

The non-final GKR checks

Total

Multipoint Evaluation $`\geq 3`$

Modifiying `RCOMB2`

Computing the denominator for multipoint evaluation $`\geq 3`$