Open bobbinth opened 1 year ago
This is great! I think that this is definitely the way to go but I have a few remarks:
MSTREAM
and the ability to load from memory into helper registers using a pointer on the top of the stack. I don't remember having seeing these operations before, do we have them documented somewhere?To simplify notation, we can write
$$ \sum_{i=0}^k{\left(\alpha_i \cdot \frac{T_i(x) - T_i(z)}{x - z} +\beta_i \cdot \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$
as
$$ \sum_{i=0}^k{\left(\alpha_i \cdot A_i +\beta_i \cdot B_i \right)} $$
Now, if we decide to go with one random value, call it $\alpha$, to compute the whole sum we can write the above as
$$ \sum_{i=0}^k{\left(\alpha^{2i} \cdot A_i +\alpha^{2i+1} \cdot B_i \right)} $$
which is equal to
$$ \sum_{i=0}^k{\alpha^{2i} \cdot Ai} +\alpha\left(\sum{i=0}^k{\alpha^{2i} \cdot B_i}\right) $$
and can be rewritten as
$$ \sum_{i=0}^k{\zeta^{i} \cdot Ai} +\alpha\left(\sum{i=0}^k{\zeta^{i} \cdot B_i}\right) $$
where $\zeta := \alpha^2$. Now we see that we have two polynomials evaluated in $\zeta$ and we can use Horner's rule to make the instructions depend only on $\alpha$. Indeed, the update equations can be changed to: $$p' = p \cdot \zeta + A_i$$ and $$r' = r \cdot \zeta+ B_i$$
and the final step $$\frac{p}{x - z} + \alpha\frac{r}{x - z \cdot g}$$
One cumbersome thing about the above is the fact that we need the values to be in reverse order, hence I am not sure how this would fit into the whole thing. However, there is a nice solution which is to generate the randomness in the reverse order. This unfortunately will necessitate that we change this at the level of the prover too.
- The above relies heavily on
MSTREAM
and the ability to load from memory into helper registers using a pointer on the top of the stack. I don't remember having seeing these operations before, do we have them documented somewhere?
It actually doesn't seem like we have to documented anywhere besides the code itself (i.e., here). It is a part of the mem_stream assembly instruction which is basically MSTREAM HPERM
.
- The way the random challenge α is updated should be changed to be αn=αo∗α2 where αo is the old value and αn is the new value. This also explains why the powers of α should not grow exponentially. One potential way to solve this would be to keep the (one and only) α in the helper registers. Since we use only one α this should be done only once but I am not sure if this can be done in the first place.
This would be somewhat tricky to do. We can't really put $\alpha$ in the helper registers because these don't persist from one cycle to the next and we don't have anything on the main part of the stack that could "bind" helper register values to the value of $\alpha$ (the reason we can do this with $T_i(z)$ values is because z_addr
is on the stack).
One way we can do it is by modifying RCOMB1
operation to look as follows:
Here, we put both $\alpha$ and $\alpha_i$ onto the stack. This allows us to compute the next value of $\alpha_i$ as:
$$ \alpha_i' = \alpha_i \cdot \alpha^2 $$
But this comes at the expense of reducing the number of $T_i(x)$ values on the stack from 8 to 6. The other 2 values would need to be handled manually (i.e., by saving them to local memory or something like that). This feels a bit "hacky" and also adds extra cycles - i.e., we'd probably go from 7K cycles to 10K cycles.
Another approach is to use your observation about Horner evaluation and use the same $\zeta$ for both running sums. For example, we could pre-compute even powers of $\alpha$ and save them to memory. Then, instead of putting $\alpha$ in the main stack, we'd put memory address for it onto the main stack and the actual value into helper registers. This modified operation would look like so:
Having to have the value in reverse order is indeed a problem as I'm not sure how to do that efficiently.
Another approach is to use your observation about Horner evaluation and use the same ζ for both running sums. For example, we could pre-compute even powers of α and save them to memory. Then, instead of putting α in the main stack, we'd put memory address for it onto the main stack and the actual value into helper registers. This modified operation would look like so:
Why not put $\zeta$ (or in fact just $\alpha$ at cost a higher degree constraint) on the stack instead of a_addr
? That's the actually the point with using Horner.
Putting the pointer to a series of powers of $\alpha$ would in fact allows to avoid the problem of reading the values in reverse order since we won't need to use Horner.
One more thing which is both simplifying and good for the soundness of FRI. According to equation (11) here, @ulrich-haboeck shows that one can simplify the problem to be
$$ \sum_{i=0}^k{\alpha^i \left( \frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$
This reduces the degree of the constraints as well.
Why not put $\zeta$ (or in fact just $\alpha$ at cost a higher degree constraint) on the stack instead of
a_addr
? That's the actually the point with using Horner.
If we put something else in place of a_addr
then we can't put the $a_i$ value in the helper registers - something on the main stack should define which values go into the helper registers.
Putting the pointer to a series of powers of $\alpha$ would in fact allows to avoid the problem of reading the values in reverse order since we won't need to use Horner.
Yeah, as I think about it more, reading $\alpha$ values from memory seems to be the best approach. So, the instruction would look like this:
If we compute random liner combination as:
$$ \sum_{i=0}^k{\alpha_i \left( \frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$
We can actually use independently drawn random values and don't need to give up any bits of security.
Or even if we have to use different values for left and right term, we can use just a single power like so:
$$ \sum_{i=0}^k{\left( \alpha_i \frac{T_i(x) - T_i(z)}{x - z} + \alpha_i^2 \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$
In the above, all $\alpha_i$ would still be independently drawn - but we use a square of $\alpha_i$ for the second term. Security impact of this should be negligible, I believe (but would be good to confirm).
The only drawback of going with this approach that I can think of is that we need to perform an extra memory read (2 reads total) per cycle. But we already do this for MMSTREAM
- so, most of the machinery for this has already been built.
If we put something else in place of
a_addr
then we can't put the $a_i$ value in the helper registers - something on the main stack should define which values go into the helper registers.
I am not sure I follow, with Horner there is no need for $a_i$ in the first place. Only the original random value need to be preserved on the stack. This is because the update equations, as mentioned above, are:
$$ p = \frac{T_i(x) - T_i(z)}{x - z} + p\cdot\alpha $$
and
$$ r = \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g}+ r\cdot\alpha $$
Ah yes! This does mean that if we go with Horner evaluation we'd be able to use a single $\alpha$ on the stack. This would simplify things. The only challenge remaining with Horner approach is that column values need to be in reverse order - or can we avoid it somehow?
Unfortunately, I don't see a way to avoid that in a clean way. Moreover, using powers of a single random value for FRI batching degrades the soundness of the commit phase in a non-negligible way. This might not be a problem when using the cubic extension field but is a significant issue for the quadratic extension case (in the list-decoding regime). Thus I think that keeping a pointer to a list of random values $\alpha_i$, one for each term of the random linear combination, is the way to go. The last remaining question is then which of the two ways of using $\alpha_i$ should we use. If the second option (the one without $\alpha_i^2$) works then I think that it should be the preferred option. I have run some initial calculations and it seems that the soundness of
$$ \sum_{i=0}^k{\alpha_i \left( \frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$
and
$$ \sum_{i=0}^k{\alpha_i \left( \frac{T_i(x) - T_i(z)}{x - z} \right) + \beta_i \left(\frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$
is indistinguishable in our setting provided the argument for using the first form is valid.
One small issue relating to the random elements $\alpha_i$ in
$$ \sum_{i=0}^k{\alpha_i \left( \frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$
relates to the fact that for each column $i$ we are using one $\alpha_i$ but the $\alpha_i$'s are usually stored as pairs [ɑ0, ɑ1, ɑ0', ɑ1']
.
This can be overcome by adding an additional input (the last available slot on the stack) to the candidate instruction that is a boolean variable $b$ that alternates between $0$ and $1$ to indicate which of the two alphas to choose.
One small issue relating to the random elements $\alpha_i$
Ah - indeed! Thank you for catching it!
This can be overcome by adding an additional input (the last available slot on the stack) to the candidate instruction that is a boolean variable $b$ that alternates between $0$ and $1$ to indicate which of the two alphas to choose.
Unfortunately, I'm not sure we'll be able to do it this way. The main reason is that to prove consistency of memory reads we need to provide values for all 4 elements of the word (so, we need to $\alpha_0$, $\alpha_1$, $\alpha_0'$, $\alpha_1'$ somewhere on the stack or into helper registers). There are two potential ways to handle this:
[ɑ0, ɑ1, 0, 0]
. This will work for memory consistency check (as we can hardcode zeros into the constraints) but will generating saving random values more expensive (probably doubling the cost).If we are willing to "hard-code" the width of the trace then we can drop either z_addr
or a_addr
, by laying out the randomness right after the OOD frame, and use an offset instead. Then we would have two available slots on the stack which we can use for keeping the other random element ($\beta$). In this case, the sum that makes most sense is
$$ \sum_{i=0}^k{\left(\alpha_i \cdot \frac{T_i(x) - T_i(z)}{x - z} +\beta_i \cdot \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$
This, unfortunately, will mean that we will have to double the number of random elements.
If we are willing to roughly double the cost of generating randomness for this part (which I think in our case would result in about 500 extra cycles), I'd probably look into creating a version of this procedure to save randomness to memory as [ɑ0, ɑ1, 0, 0]
. For example, we could change this line to something like:
save R1 to mem[dest] and mem[dest + 1]
swapw movup.3 movup.3 push.0.0 dup.15 mem_storew
movup.5 movup.5 movup.3 movup.3 dup.15 add.1 mem_storew drop drop
The above adds 11 cycles to the loop (and doesn't handle everything) - but maybe there is a better way to do something like that.
I agree, this has also the advantage of not needing any "hardcoding". All in all, I think this concludes the last remaining question/issue related to the design of this new op.
Some different variations on the above problem are gathered in what follows. This is done in order to inform the design of the proposed instruction in the hope of making it as general/useful as possible.
Fix an $x$ in the LDE domain and let
$$ \mathsf{S} := \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)}. $$
Then
\begin{aligned}
\mathsf{S} &= \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} \right)} + \sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} \\
&= \sum_{i=0}^k {\left(\frac{\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z)}{x - z} \right)} + \sum_{i=0}^k{\left(\frac{\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z \cdot g)}{x - z \cdot g} \right)} \\
&= \frac{1}{x - z}\sum_{i=0}^k \left(\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z) \right) +\frac{1}{x - z \cdot g} \sum_{i=0}^k\left(\alpha_i\cdot T_i(x) - \alpha_i\cdot T_i(z \cdot g) \right) \\
&= \frac{1}{x - z} \left(\mathsf{ip}_x - \mathsf{ip}_z \right) +\frac{1}{x - z \cdot g} \left(\mathsf{ip}_x - \mathsf{ip}_{gz} \right)
\end{aligned}
$$
\mathsf{ip}x := \langle\alpha{\cdot}, T{\cdot}(x)\rangle = \sum{i=0}^k\alpha_i\cdot T_i(x)$
$
$$
\mathsf{ip}z := \langle\alpha{\cdot}, T{\cdot}(z)\rangle := \sum{i=0}^k\alpha_i\cdot Ti(z)$
$
$$
\mathsf{ip}{gz} := \langle\alpha{\cdot}, T{\cdot}(z \cdot g)\rangle := \sum_{i=0}^k\alpha_i\cdot T_i(z \cdot g)$
$
Notice the following:
\mathsf{ip}_z
$ and $\mathsf{ip}_{gz}
$ are independent of $x$ and as such they can be computed once and for all FRI queries.x
$ are $\mathsf{ip}_x
$, $\frac{1}{x - z}
$ and $\frac{1}{x - gz}
$.\mathsf{ip}_x
$, $\mathsf{ip}_z
$ or $\mathsf{ip}_{gz}
$ amounts to the same as computing an inner product of two vectors.Instead of working with
$$
\sum_{i=0}^k\alpha_i {\left(\frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - Ti(z \cdot g)}{x - z \cdot g} \right)}.$
$
we can work instead with
$$
\sum{i=0}^k\alpha_i {\frac{T_i(x) - p_i(x)}{(x - z)(x - z \cdot g)}}$
$
where $p_i(x) = a_i + b_i\cdot x$ is the line interpolating ${\left(z, T_i(z)\right), \left(z \cdot g, T_i(z \cdot g)\right)}$.
As above, this can be written as
$$
\frac{1}{(x - z)(x - z \cdot g)} \left(\mathsf{ip}_x - \left(\mathsf{ip}_a + x\mathsf{ip}_b\right)\right)$
$
Again:
\mathsf{ip}_a
$ and $\mathsf{ip}_{b}
$ are independent of $x$ and as such they can be computed once and for all FRI queries.x
$ are $\mathsf{ip}_x
$, $\frac{1}{(x - z)(x - z \cdot g)}
$ and the evaluation $\left(\mathsf{ip}_a + x\mathsf{ip}_b\right)
$.\mathsf{ip}_x
$, $\mathsf{ip}_a
$ or $\mathsf{ip}_{b}
$ amounts to the same as computing an inner product of two vectors.The advantage of the one-quotient method over the previous method is namely the fact that it uses only one quotient i.e., one division. This comes at the cost of needing to compute $a_i$ and $b_i$ from $\{\left(z, T_i(z)\right), \left(z \cdot g, T_i(z \cdot g)\right)\}
$, but only once for each $i$, and one extension multiply-add field operation to compute $\left(\mathsf{ip}_a + x\mathsf{ip}_b\right)
$ at each $x$.
\geq 3
$Sometimes we might be interested in evaluations of some of the $T_i$'s at more than just $2$ points. For the sake of concretness, assume $T_k$ is evaluated at $4$ points. This means that the last term in the above sum will be
$$
\alpha_k \left(\frac{T_k(x) - T_k(z_0)}{x - z_0} + \frac{T_k(x) - T_k(z_1)}{x - z_1} + \frac{T_k(x) - T_k(z_2)}{x - z_2} + \frac{T_k(x) - T_k(z_3)}{x - z_3} \right)$
$
If we want to avoid the four divisions, we can instead work with
$$
\frac{\alpha_k}{\left(x - z_0\right)\cdot\left(x - z_1\right)\cdot\left(x - z_2\right)\cdot\left(x - z_3\right)} \left({T_k(x) - p_k(x)} \right)$
$
where $p_k$ is the polynomial of degree at most $3$ interpolating the points $(z_i, T_k(z_i))$. Again $p_k$ needs to be computed only once for all $x$'s.
One problem with the above is obtaining $p_k$ from the evaluations $(z_i, T_k(z_i))$ when the number of evaluations is relatively large. Here are two potential solutions to this:
Regarding the question of computing the coefficients of $p_k$ polynomial non-deterministically, the following is an attempt at solving it. Assume that the following procedure costs one cycle
#! +-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+---+
#! | C01 | C00 | C11 | C10 | C21 | C20 | C31 | C30 | x1 | x0 | acc1 | acc0 |c_addr| ctr | - | - |
#! +-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+---+
#!
#! ||
#! \/
#!
#! +---+-------+-------+-------+-------+-------+-------+-------+-------+------+------+-------+-------+------+-----+-----+
#! | b | C01 | C00 | C11 | C10 | C21 | C20 | C31 | C30 | x1 | x0 | acc1' | acc0' |c_addr| ctr'| - |
#! +---+-------+-------+-------+-------+-------+-------+-------+-------+------+------+-------+-------+------+-----+-----+
#!
#! where:
#!
#! 1. acc' = (acc0', acc0') := ((((acc * x + c3) * x + c2) * x + c1) * x) + c0
#! 2. ctr' = ctr + 4
#! 3. b = 0 if ctr' == 0 else 1
#!
#! Cycles: 1
export.horner_eval
end
Then, for a trace of length $2^{20}$ i.e., $21$ evaluation points, we can do as follows:
#! Input: [COEFF_PTR, EVALS_PTR, z1, z0, ...]
#! Output: [...]
#! Cycles: 1350
export.check_transformation_21
# 0) Push the number of coefficients for later Horner evaluation
push.24 neg
dup.1
#=> [COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]
# 1) Use `adv_pipe` to load the coefficients
## a) Prepare stack
dup padw padw padw
## b) Load the coefficients form the advice stack
# The coefficients are loaded in reverse order and they are padded with 0-coefficients
# for the highest monomials in order to make the number of coefficients divisible by 4
repeat.6
adv_pipe hperm
end
#=> [Y, y, y, t1, t0, Y, y, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]
# 2) Evaluate the polynomial using the coefficients at (t0, t1)
## a) Reset the pointer to point to the coefficients
swapw.3 drop dup.1
#=> [Y, y, y, t1, t0, Y, COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]
## b) Prepare the stack for Horner evaluation by creating an accumulator to hold the evaluation
swapw drop drop push.0 movdn.2 push.0 movdn.2
#=> [t1, t0, 0, 0, Y, Y, COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]
swapw.2
#=> [Y, Y, t1, t0, 0, 0, COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]
## c) Evaluate
push.1
while.true
mem_stream horner_eval
end
#=> [Y, Y, t1, t0, res1, res0, COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...] where (res0, res1) is the evaluation of the polynomial
# at (t0, t1) using the coefficients.
## d) Clean up the stack
dropw dropw
#=> [t1, t0, res1, res0, COEFF_PTR, ctr, COEFF_PTR, EVALS_PTR, z1, z0, ...]
swapw drop drop swap
#=> [EVALS_PTR, COEFF_PTR, t1, t0, res1, res0, z1, z0, ...]
# 3) Evaluate the polynomial using the Barycentric (2nd) formula at t/z where t is the random challenge
# and z is the OOD point. We evaluate at t/z in order to have the evaluation points of the Barycentric
# formula be independent of z, and hence we can hardcode the evaluation points and the barycentric weights
# for increased performance.
## a) Compute t/z
movup.3 movup.3 movup.7 movup.7
#=> [z1, z0, t1, t0, EVALS_PTR, COEFF_PTR, res1, res0, ...]
ext2div ext2mul
#=> [zeta1, zeta0, EVALS_PTR, COEFF_PTR, res1, res0, ...]
## b) Create accumulators
padw
#=> [q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR, COEFF_PTR, res1, res0, ...] where p holds the numerator and q holds the denominator
## c) Load evaluations i.e. OOD values
padw dup.10 add.1 swap.11
#=> [v1, v0, u1, u0, q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
## d) Compute the 0-th term
### i) move up zeta to top
dup.9 dup.9
#=> [zeta1, zeta0, v1, v0, u1, u0, q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
### ii) Push the 0-th evaluation point xj
swap push.1 neg add swap
#=> [zeta1, zeta0 - xj0, v1, v0, u1, u0, q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
ext2inv
#=> [den1, den0, v1, v0, u1, u0, q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...] where (den0, den1) = (zeta0, zeta1) - (xj, 0)
### iii) Push the 0-th barycentric weight fj
push.8475095067679657783 push.0 ext2mul
#=> [fj1, fj0, v1, v0, u1, u0, q1, q0, p1, p0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...] where fj = wj/(zeta - xj)
### iv) add fj to p
movup.9 movup.9 dup.3 dup.3 ext2add movdn.9 movdn.9
#=> [fj1, fj0, v1, v0, u1, u0, q1, q0, p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
### v) multiply fj by u and add to q
movup.5 movup.5 ext2mul movup.5 movup.5 ext2add movdn.3 movdn.3
#=> [v1, v0, q1', q0', p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
## e) Compute the 0-th term
### i) Move up zeta to top
dup.7 dup.7
#=> [zeta1, zeta0, v1, v0, q1', q0', p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
### ii) Push the 1-st evaluation point xj
swap push.1971462654193939361 neg add swap
#=> [zeta1, zeta0 - xj0, v1, v0, q1', q0', p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
ext2inv
#=> [den1, den0, v1, v0, q1', q0', p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...] where (den0, den1) = (zeta0, zeta1) - (xj, 0)
### iii) Push the 1-st barycentric weight wj
push.1813978751105035889 push.0 ext2mul
#=> [fj1, fj0, v1, v0, q1', q0', p1', p0', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...] where fj = wj/(zeta - xj)
### iv) Add fj to p
movup.7 movup.7 dup.3 dup.3 ext2add movdn.7 movdn.7
#=> [fj1, fj0, v1, v0, q1', q0', p1'', p0'', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
### v) multiply fj by v and add to q
ext2mul ext2add
#=> [q1'', q0'', p1'', p0'', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
### vi) update pointer
# Repeat c) d) and e) 10 times and finish with c) and d) each time with the appropriate set of constants
#=> [q1'', q0'', p1'', p0'', zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
# 3) Compute the barycentric evaluation i.e., p / q
movup.3 movup.3 ext2inv ext2mul
#=> [res_bar1, res_bar0, zeta1, zeta0, EVALS_PTR', COEFF_PTR, res1, res0, ...]
# 4) Assert equality
movup.6 assert_eq movup.5 assert_eq
# 5) Clean up the stack
dropw
#=> []
end
The above uses the fact that, for the barycentric evaluation formula, we can factorize the OOD point z and thus make the procedure, for a given trace length, parametric only over z. The following code illustrate this idea.
One other point that was worth investigating relates to doing away with RCOMB2
. The motivation behind this is the potential simplifications to the auxiliary trace that might bring the number of auxiliary columns to just 2 with one of these columns having OOD evaluations at only $z$ and $g\cdot z$.
The following is an attempt at computing the coefficients of the polynomial corresponding to the auxiliary column that is being opened at $z$ and $g\cdot z$. This method is then used (see below) in computing the terms in the random linear combination associated with this column.
#! Compute the slope and offset of the line for a typical column in auxiliary trace
#!
#! we know:
#! 1. the pointer to OOD frame containing `t_i(z)` and `t_i(gz)` `ood_trace_ptr`
#! 2. the pointer to the current trace row `current_trace_row_ptr`
#! 3. the pointer to the OOD point `z` `z_ptr`
#! 4. the pointer to the trace domain generator `g` `trace_domain_gen_ptr`
#!
#! The equation of the line is: y = a.x + b with:
#!
#! 1. a = (a0, a1) := (t_i(z) - t_i(gz)) / (z - gz)
#! 2. b = (b0, b1) := (gt(z) - t(gz)) / (g - 1)
#!
#! Input: [offset, ...]
#! Output: [a1, a0, b1, b0, ...]
#!
#! Cycles: 64
export.compute_slope_bias
# 1) Load OOD evaluations
padw
movup.4 push.ood_trace_ptr add
mem_loadw
#=> [T11, T10, T01, T00, ...] where (T00, T01) = t_i(z) and (T10, T11) = t_i(gz)
# 2) Compute b
# Load g
push.trace_domain_gen_ptr
#=> [g, T11, T10, T01, T00, ...]
# Compute numerator of b
dup.4 dup.1 mul
#=> [g.T00, g, T11, T10, T01, T00, ...]
dup.4 dup.2 mul
#=> [g.T01, g.T00, g, T11, T10, T01, T00, ...]
dup.4 neg dup.4 neg
#=> [-T11, -T10, g.T01, g.T00, g, T11, T10, T01, T00, ...]
ext2add
#=> [num1, num0, g, T11, T10, T01, T00, ...] where num = (num0, num1) := gt(z) - t(gz)
movup.2 sub.1 inv mul
#=> [b1, b0, T11, T10, T01, T00, ...] where b = (b0, b1) := (gt(z) - t(gz)) / (g - 1)
# 3) Compute a
# Compute gz - z
padw push.Z_PTR
#=> [gz1, gz0, z1, z0, b1, b0, T11, T10, T01, T00, ...]
ext2sub
#=> [a_deno_1, a_deno_0, b1, b0, T11, T10, T01, T00, ...] where (a_deno_0, a_deno_1) := z - gz
# Compute t_i(z) - t_i(gz)
swapw ext2sub
#=> [a_numer_1, a_numer_0, a_deno_1, a_deno_0, b1, b0, ...] where (a_numer_0, a_numer_1) := t_i(z) - t_i(gz)
movup.3 movup.3 ext2inv ext2mul
#=> [a1, a0, b1, b0, ...] where (a0, a1) := (t_i(z) - t_i(gz)) / (z - gz)
end
In addition to the terms related to the aforementioned 2 auxiliary trace columns, the terms related to the columns of the constraint composition polynomial are also currently computed using RCOMB2
. The following is a proposal for computing these terms without the helper instruction:
#! Computes the random linear combination of the terms related to the constraint composition polynomials.
#!
#! Input: [x, x_addr, c_addr, a_addr, ...]
#! Output: #=> [x, final_result1, final_result0, ...]
#!
#! Cycles: ~340
export.rand_combine_constraint
movdn.3
#=> [x_addr, c_addr, a_addr, x, ...]
push.0 push.0
#=> [acc1, acc0, x_addr, c_addr, a_addr, x, ...]
repeat.4
padw dup.6 add.1 swap.7 mem_loadw
#=> [t11, t10, t01, t00, acc1, acc0, x_addr, c_addr, a_addr, x, ...] where (t00, t01) = C_i(x) and (t10, t11) = C_{i+1}(x)
padw dup.11 add.1 swap.12 mem_loadw
#=> [w11, w10, w01, w00, t11, t10, t01, t00, acc1, acc0, x_addr, c_addr, a_addr, x, ...] where (w00, w01) = C_i(z) and (w10, w11) = C_{i+1}(z)
movup.7 movup.7 movup.5 movup.5
#=> [w01, w00, t01, t00, w11, w10, t11, t10, acc1, acc0, x_addr, c_addr, a_addr, x, ...]
ext2sub
#=> [u1, u0, w11, w10, t11, t10, acc1, acc0, x_addr, c_addr, a_addr, x, ...] where (u0, u1) = C_i(x) - C_i(z)
movup.5 movup.5 movup.5 movup.5
#=> [w11, w10, t11, t10, u1, u0, acc1, acc0, x_addr, c_addr, a_addr, x, ...]
ext2sub
#=> [v1, v0, u1, u0, acc1, acc0, x_addr, c_addr, a_addr, x, ...] where (v0, v1) = C_{i+1}(x) - C_{i+1}(z)
padw dup.12 add.1 swap.13 mem_loadw
#=> [b1, b0, a1, a0, v1, v0, u1, u0, acc1, acc0, x_addr, c_addr, a_addr, x, ...] where (a0, a1) = alpha_i and (b0, b1) = alpha_{i+1}
movup.5 movup.5 ext2mul
#=> [(v*b)1, (v*b)0, a1, a0, u1, u0, acc1, acc0, x_addr, c_addr, a_addr, x, ...]
movdn.5 movdn.5
#=> [a1, a0, u1, u0, (v*b)1, (v*b)0, acc1, acc0, x_addr, c_addr, a_addr, x, ...]
ext2mul
#=> [(a*u)1, (a*u)0, (v*b)1, (v*b)0, acc1, acc0, x_addr, c_addr, a_addr, x, ...]
ext2add ext2add
#=> [acc1', acc0', x_addr, c_addr, a_addr, x, ...]
end
#=> [acc1', acc0', x_addr, c_addr, a_addr, x, ...]
movdn.4 movdn.4
#=> [x_addr, c_addr, a_addr, acc1', acc0', x, ...]
push.0 push.Z_PTR mem_loadw drop drop
#=> [z1, z0, acc1', acc0', x, ...]
neg swap neg dup.4 add
#=> [-z1, x-z0, acc1', acc0', x, ...]
ext2inv ext2mul movup.2
#=> [final_result1, final_result0, x, ...]
end
In the case of the so-called Lagrange kernel, and assuming that we have already loaded and checked the coefficients of the polynomial $p_k$ for this column, the following is a proposal for how to combine the terms related to the first auxiliary column (opened at $z$ and $g\cdot z$) and the second auxiliary column (i.e., Lagrange kernel column, opened at $z$, $g\cdot z$, $g^2\cdot z$ ... $g^{\log_2(n) - 1}\cdot z$:
#! Input: [x, x_addr, c_addr, a_addr, ...]
#! Output: [res1, res0, ...]
#! where (res0, res1) = alpha_k * (t_k(x) - (a * x + b)) / ((x - z)*(x - gz))
#! + alpha_{k+1} * (t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i))/((x - z)*...*(x - zg^(2^(nu-1))))
#!
#! Cycles: 210 + log(trace_len) * 26
export.rand_combine_aux_multi_openings.2
# 1) Store x and the pointers (4 cycles)
loc_storew.0
# 2) Compute alpha_k *(t_k(x) - (a * x + b)) (64 cycles)
## a) setup the stack
### i) `ctr`: The number of coefficients is 4 since horner_eval works with batches of 4 coefficients
### and the next multiple of 4 from 2 is 4. `ctr` is not used until the second aux column as the number
### of evaluation points is dynamic and equal to log(trace_length)
push.TRACE_LENGTH_LOG_PTR mem_load push.4 add neg
#=> [ctr, x, x_addr, c_addr, a_addr, ...]
### ii) `c_addr`
movup.3
#=> [c_addr, ctr, x, x_addr, a_addr, ...]
### iii) `acc`
push.0.0
#=> [0, 0, c_addr, ctr, x, x_addr, a_addr, ...]
### iv) `x`
movup.4
#=> [x, 0, 0, c_addr, ctr, x_addr, a_addr, ...]
### v) prepare for `mstream`
movup.6 movup.6 push.0.0 padw
#=> [Y, Y, x, 0, 0, c_addr, ctr, ...] where Y are values that can be overwritten by mstream
## b) compute a * x + b
mstream horner_eval drop
#=> [Y, Y, x, acc1, acc0, c_addr + 2, ctr + 4, ...] where Y are values that can be overwritten by mstream
## c) compute t_k(x) - (a * x + b)
loc_loadw.0
#=> [x, x_addr, c_addr, a_addr, Y, x, acc1, acc0, c_addr + 2, ctr + 4, ...]
### i) load t_k(x)
push.0.0 movup.3 mem_loadw drop drop
#=> [t_k(x)1, t_k(x)0, a_addr, Y, x, acc1, acc0, c_addr + 2, ctr + 4, ...]
### ii) reset acc and move current acc value to top of stack
push.0 swap.7 push.0 swap.7
#=> [acc1, acc0, t_k(x)1, t_k(x)0, a_addr, Y, x, 0, 0, c_addr + 2, ctr + 4, ...]
### iii) perform subtraction
ext2sub
#=> [res1, res0, a_addr, Y, x, 0, 0, c_addr + 2, ctr + 4, ...] where (res0, res1) = t_k(x) - (a * x + b)
## d) multiply by alpha_k
### i) load alpha_k
padw movup.6 mem_loadw push.0.0 swapw
#=> [alpha_k1, alpha_k0, res1, res0, 0, 0, y, y, Y, x, 0, 0, c_addr + 2, ctr + 4, ...]
### ii) multiply and store result
ext2mul loc_storew.1
#=> [out_k1, out_k0, 0, 0, y, y, Y, x, 0, 0, c_addr + 2, ctr + 4, ...] where (out_k0, out_k1) = alpha_k *(t_k(x) - (a * x + b))
### iii) clean up stack for next term in random linear combination
drop drop
#=> [Y, Y, x, 0, 0, c_addr + 2, ctr + 4, ...]
# 3) Compute alpha_{k+1} * (t_{k+1}(x) - (sum_0^{nu - 1} c_i * x^i)) (45 + log(trace_len)/4)
## a) compute sum_0^{nu + 1} c_i * x^i
push.1
while.true
mstream horner_eval
end
#=> [Y, Y, x, acc1, acc0, c_addr + nu, 0, ...]
## b) compute t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i)
### i) get pointer to t_{k+1}(x) and store c_addr + nu
dropw loc_loadw.0
#=> [x, x_addr, c_addr, a_addr, x, acc1, acc0, c_addr + nu, 0, ...]
movup.7 swap.3 movdn.7 loc_storew.0
#=> [x, x_addr, c_addr + nu, a_addr, x, acc1, acc0, c_addr, 0, ...]
### ii) load t_k(x)
push.0.0 movup.3 mem_loadw
#=> [t_{k+1}(x)1, t_{k+1}(x)0, y, y, a_addr, x, acc1, acc0, c_addr, 0, ...]
### iii) subtract
movup.7 movup.7
#=> [acc1, acc0, t_{k+1}(x)1, t_{k+1}(x)0, y, y, a_addr, x, c_addr, 0, ...]
ext2sub
#=> [res1, res0, y, y, a_addr, x, c_addr, 0, ...] where (res0, res1) = t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i)
## c) compute alpha_{k+1} * res
### i) load alpha
swapw push.0 swap mem_loadw
#=> [alpha1, alpha0, y, y, res1, res0, y, y, ...] where (res0, res1) = t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i)
### ii) multiply
movup.5 movup.5 ext2mul
#=> [outkk1, outkk0, y, y, y, y, ...] where (outkk0, outkk1) = alpha_{k+1} *(t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i))
# 4) Compute prod_1^{nu - 1} (x - zg^(2^i)) (18 cycles + 26 * log(trace_len))
## a) load g (the trace domain generator) and compute g^2
push.trace_domain_gen_ptr mem_load dup mul
#=> [g^2, outkk1, outkk0, y, y, y, y, ...]
## b) load x and z
loc_load.0 swapw push.Z_PTR drop drop
#=> [z1, z0, x, g^2, outkk1, outkk0, ...]
## c) compute the counter for the while loop (TODO: make it dynamic)
## we have points g^(2^i) for i=1..(nu - 1) where nu = log(n). Taking nu=21, implies ctr is 20.
## TODO: make sure the nu - 1 is a multiple of 2
push.0 push.TRACE_LENGTH_LOG_PTR sub movdn.4
#=> [z1, z0, x, g^2, ctr, outkk1, outkk0, ...]
## d) create product accumulator
push.1 push.0
#=> [0, 1, z1, z0, x, g^2, ctr, outkk1, outkk0, ...]
## e) compute the product
push.1
while.true
#=> [0, 1, z1, z0, x, g^2, ctr]
dup.5 dup mul swap.6
#=> [g^2, 0, 1, z1, z0, x, g^4, ctr]
dup.4 dup.1 mul
#=> [z0g^2, g^2, 0, 1, z1, z0, x, g^4, ctr]
neg dup.6 add
#=> [x-z0g^2, g^2, 0, 1, z1, z0, x, g^4, ctr]
dup.4 movup.2 mul
#=> [z1g^2, x-z0g^2, 0, 1, z1, z0, x, g^4, ctr]
neg
#=> [-z1g^2, x-z0g^2, 0, 1, z1, z0, x, g^4, ctr]
ext2mul
#=> [acc1, acc0, z1, z0, x, g^4, ctr]
dup.5 add.1 swap.6 neq.0
end
#=> [acc1, acc0, z1, z0, x, g^(2^(nu-1)), 0, outkk1, outkk0, ...]
# 5) Compute (v0, v1) := alpha_{k+1} * (t_{k+1}(x) - (sum_0^{nu + 1} c_i * x^i))/((x - zg^2)*...*(x - zg^(2^(nu-1)))) (15 cycles)
ext2inv movup.8 movup.8 ext2mul
#=> [v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...]
# 6) Compute (x - z) * (x - gz) (26 cycles)
## a) compute x - z
dup.4 dup.4 sub
#=> [x - z0, v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...]
dup.3 neg
#=> [-z1, x - z0, v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...]
## b) compute x - gz
dup.6 dup.6 mem_load.trace_domain_gen_ptr mul sub
#=> [x - z0g, -z1, x - z0, v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...]
dup.5 mem_load.trace_domain_gen_ptr neg
#=> [-gz1, x - gz0, -z1, x - z0, v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...]
## c) compute product
ext2mul ext2inv
#=> [tmp1, tmp0, v1, v0, z1, z0, x, g^(2^(nu-1)), 0, ...] where (tmp0, tmp1) = ((x - z) * (x - gz))^(-1)
# 7) Compute the final result (34 cycles)
## a) load (out_k0, out_k1) = alpha_k *(t_k(x) - (a * x + b))
swapw loc_loadw.1
#=> [out_k1, out_k0, 0, 0, tmp1, tmp0, v1, v0, 0, ...]
## b) add terms before dividing by common divisor
movup.7 movup.7 ext2add
#=> [numr1, numr0, 0, 0, tmp1, tmp0, 0, ...]
## c) compute final result
movup.5 movup.5 ext2mul
#=> [final_res1, final_res0, 0, 0, 0, ...]
## d) load pointers, update and return
movdn.4 movdn.4 push.0 loc_loadw.0
#=> [x, x_addr, c_addr, a_addr, final_res1, final_res0, ...]
swap add.1 swap
#=> [x, x_addr + 1, c_addr, a_addr, final_res1, final_res0, ...]
swap.3 add.1 swap.3
#=> [x, x_addr + 1, c_addr, a_addr + 1, final_res1, final_res0, ...]
end
Putting everything together and assuming the non-deterministic computation of the coefficients of the polynomial associated to the Lagrange kernel happens before the call to compute_deep_composition_polynomial_queries
, compute_deep_composition_polynomial_queries
given the above can be implemented as follows:
#! Compute the DEEP composition polynomial FRI queries.
#!
#! Input: [query_ptr, ...]
#! Output: [...]
#! Cycles: 6 + num_queries * (1000 + 26 * log(trace_len))
export.compute_deep_composition_polynomial_queries.1
exec.constants::fri_com_ptr
dup.1
#=>[query_ptr, query_end_ptr, ...]
push.1
while.true
# I)
#
# Load the (main, aux, constraint)-traces rows associated with the current query and get
# the index of the query.
#
# Cycles: 200
exec.load_query_row
#=>[index, query_ptr, query_end_ptr, ...]
# II)
#
# Compute x := offset * domain_gen^index and denominators (x - z) and (x - gz)
#
# Cycles: 71
exec.compute_denominators
loc_storew.0
#=> [Z, x, index, query_ptr, query_end_ptr, ...] where Z := [-gz1, x - gz0, -z1, x - z0]
# III)
#
# Prepare to compute the sum \sum_{i=0}^k{\left(\alpha_i \cdot \frac{T_i(x) - T_i(z)}{x - z}
# + \alpha_i \cdot \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g}
# We can factorize (x - z) and (x - gz) and divide the two sums only once and at the end.
# The two sums are stored in [Acc3, Acc2] and [Acc1, Acc0] respectively.
## a) Push pointers
##
## Cycles: 4
push.0
exec.constants::deep_rand_coef_ptr
exec.constants::ood_trace_ptr
exec.constants::current_trace_row_ptr
#=> [P, Z, x, index, query_ptr, query_end_ptr, ...]
# where P := [CURRENT_TRACE_ROW_PTR, OOD_TRACE_PTR, DEEP_RAND_CC_PTR, 0]
## b) Push the accumulators
##
## Cycles: 4
padw
#=> [Acc, P, Z, x, index, query_ptr, query_end_ptr, ...]
#=> where Acc =: [Acc3, Acc2, Acc1, Acc0]
## c) This will be used to mstream the elements T_i(x)
##
## Cycles: 11
padw movupw.3
#=> [Z, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...]
## d) Compute the random linear combination for main trace columns
##
## Cycles: 81
exec.combine_main_trace_columns
#=> [Y, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...]
## e) Compute the random linear combination for aux trace columns
##
## Cycles: 225 + log(trace_len) * 26
dropw dropw swapw swap drop push.COEFF_PTR swap movup.8
#=> [x, P, Acc, index, query_ptr, query_end_ptr, ...]
# where P := [CURRENT_TRACE_ROW_PTR, COEFF_PTR, DEEP_RAND_CC_PTR, 0]
exec.rand_combine_aux_multi_openings
#=> [x, P, aux_res1, aux_res0, Acc, index, query_ptr, query_end_ptr, ...]
## f) Compute the random linear combination for constraint composition columns and add to
## the term related to auxiliary trace columns.
##
## Cycles: 340
exec.rand_combine_constraint
#=> [x, const_comp_res1, const_comp_res0, aux_res1, aux_res0, Acc, index, query_ptr, query_end_ptr, ...]
movdn.8
ext2add
#=> [tmp_res1, tmp_res0, Acc, x, index, query_ptr, query_end_ptr, ...]
## g) Load the denominators (x - z) and (x - gz)
## Cycles: 10
movdn.5 movdn.5
padw loc_loadw swapw
#=> [Acc, Z, tmp_res1, tmp_res0, x, index, query_ptr, query_end_ptr, ...]
## h) Divide by denominators and sum to get final result
##
## Cycles: 38
exec.divide_by_denominators_and_sum
#=> [eval1, eval0, tmp_res1, tmp_res0, x, index, query_ptr, query_end_ptr, ...]
## i) Compute the final result
##
## Cycles: 5
ext2add
#=> [res1, res0, x, index, query_ptr, query_end_ptr, ...]
# IV)
#
# Store [poe, index, eval_1, eval_0] where poe := g^index = x / offset and prepare stack
# for next iteration.
## a) Compute poe
##
## Cycles: 4
movup.3 movup.3
exec.constants::domain_offset_inv mul
#=> [poe, index, eval1, eval0, query_ptr, query_end_ptr, ...]
## b) Store [eval0, eval1, index, poe]
##
## Cycles: 5
dup.4 add.1 swap.5
mem_storew
#=> [poe, index, eval1, eval0, query_ptr+1, query_end_ptr, ...]
## c) Prepare stack for next iteration
##
## Cycles: 8
dropw
dup.1 dup.1
neq
#=> [?, query_ptr+1, query_end_ptr, ...]
end
drop drop
end
To give a concrete example of the increase in costs observed using the above implementation:
compute_deep_composition_polynomial_queries
.compute_deep_composition_polynomial_queries
+ compute_slope_bias
+check_transformation_21
Again in the second bullet point we do away with RCOMB2
but introduce a 1-cycle instruction to do Horner evaluations.
Another potential thing that should be investigated is whether we can use RCOMB2
(or a slight modification of it) instead of RCOMB1
and do away with RCOMB1
instead. This might incur some additional cost for main trace columns but this might be justified for the savings we might get in relation to auxiliary and constraint composition columns (TBD).
The following is an attempt at improving the above and it relies on two ingredients:
RCOMB2
can be modified in order to be useful not only for terms coming from the auxiliary trace but also for the ones associated to the constraint composition polynomial.RCOMB2
The RCOMB2
instruction currently does the following
$$(r, p) = (r, p) + \left(\alpha_i\cdot\left({T_i(x) - T_i(z)}\right), \alpha_i\cdot\left({T_i(x) - T_i(z \cdot g)} \right)\right)$$
Using RCOMB2
for constraint composition columns is currently not straghtforward and it entails a non-negligable number of stack manipulations.
In order to make RCOMB2
work for constraint composition columns without friction, we can update the instruction to compute instead
$$(r, p) = (r, p) + \left(\alpha_i\cdot\left({T_i(x) - T_i(z)}\right), (1 - b)\cdot\alpha_i\cdot\left({T_i(x) - T_i(z \cdot g)} \right)\right)$$
where $b$ is a binary flag that is set to $0$ for auxiliary trace columns and is set to $1$ for constraint composition columns.
The price to pay for this adaptation is an increase in the degree of the constraint.
\geq 3
$In the above code, we computed $$(x - z)\cdot\prod_{i=0}^{\log(n) - 1}\left(x - z\cdot g^{2^i} \right)$$
This required a while
loop and it was the main bottleneck when computing the random linear combination for a given $x$.
Here is how we can improve on that solution:
The above expression can be written as follows: $$z^{\log(n) + 1}(\frac{x}{z} - 1)\cdot\prod_{i=0}^{\log(n) - 1}\left(\frac{x}{z} - g^{2^i} \right)$$ which can be written more compactly as $$z^{\log(n) + 1}m\left(\frac{x}{z} \right)$$
where $$m(x) = (x - 1)\cdot\prod{i=0}^{\log(n) - 1}\left(x - g^{2^i} \right)$$ Note that $m$ is a polynomial of degree $\log(n) + 1$ and thus can be written as $`m(x) = \Sigma{i=0}^{\log(n) + 1} c_i \cdot x^i`$. Importantly, its coefficients are independent of $z$. This suggests the following procedure to compute our denominator:
adv_pipe
to a designated region of memory. At the same time, compute the hash of these coefficients and compare this hash with the expected hash for the given $n$. This is done only once.In conclusion, at the cost of a slight modification to RCOMB1
and using a new, but versatile, new instruction to do Horner evaluations, we can implement compute_deep_composition_polynomial_queries
in approximately 610
cycles per FRI query, in addition to some modest computation that is query independent.
Here is a proposal for how to adapt compute_deep_composition_polynomial_queries
in light of the most recent proposal:
#! Compute the DEEP composition polynomial FRI queries.
#!
#! Input: [query_ptr, ...]
#! Output: [...]
#! Cycles: 6 + num_queries * 610
export.compute_deep_composition_polynomial_queries
exec.constants::fri_com_ptr
dup.1
#=>[query_ptr, query_end_ptr, ...]
push.1
while.true
# I)
#
# Load the (main, aux, constraint)-traces rows associated with the current query and get
# the index of the query.
#
# Cycles: 200
exec.load_query_row
#=>[index, query_ptr, query_end_ptr, ...]
# II)
#
# Compute x := offset * domain_gen^index and denominators (x - z) and (x - gz)
#
# Cycles: 71
exec.compute_denominators
push.Z_MINUS_X_PTR mem_storew
#=> [Z, x, index, query_ptr, query_end_ptr, ...] where Z := [-gz1, x - gz0, -z1, x - z0]
# III)
#
# Prepare to compute the sum \sum_{i=0}^k{\left(\alpha_i \cdot \frac{T_i(x) - T_i(z)}{x - z}
# + \alpha_i \cdot \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g}
# We can factorize (x - z) and (x - gz) and divide the two sums only once and at the end.
# The two sums are stored in [Acc3, Acc2] and [Acc1, Acc0] respectively.
## a) Push pointers
##
## Cycles: 4
dup.4
exec.constants::deep_rand_coef_ptr
exec.constants::ood_trace_ptr
exec.constants::current_trace_row_ptr
#=> [P, Z, x, index, query_ptr, query_end_ptr, ...]
# where P := [CURRENT_TRACE_ROW_PTR, OOD_TRACE_PTR, DEEP_RAND_CC_PTR, x]
## b) Push the accumulators
##
## Cycles: 4
padw
#=> [Acc, P, Z, x, index, query_ptr, query_end_ptr, ...]
#=> where Acc =: [Acc3, Acc2, Acc1, Acc0]
## c) This will be used to mstream the elements T_i(x)
##
## Cycles: 11
padw movupw.3
#=> [Z, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...]
## d) Compute the random linear combination for main trace columns
##
## Cycles: 81
exec.combine_main_trace_columns
#=> [Y, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...]
## e) Compute the random linear combination for aux trace columns
##
## Cycles: 150
exec.combine_aux_trace_columns
#=> [Y, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...]
## f) Constraint composition polys terms
##
## Cycles: 10
repeat.2
mstream
repeat.4
exec.combine_aux
end
end
#=> [Y, Y, Acc, P, x, index, query_ptr, query_end_ptr, ...]
## g) load k_th term
##
## Cycles: 3
push.K_TH_TERM_PTR mem_loadw swapw.3
#=> [P, Y, Acc, k_th_term1, k_th_term0, y, y, x, index, query_ptr, query_end_ptr, ...]
## h) load Z
##
## Cycles: 7
dropw
push.Z_MINUS_X_PTR mem_loadw
swapw
#=> [Acc, Z, k_th_term1, k_th_term0, y, y, x, index, query_ptr, query_end_ptr, ...]
## i) Divide by denominators and sum to get final result
##
## Cycles: 45
exec.divide_by_denominators_and_sum
#=> [eval_tmp1, eval_tmp0, k_th_term1, k_th_term0, y, y, x, index, query_ptr, query_end_ptr, ...]
ext2add
#=> [eval1, eval0, y, y, x, index, query_ptr, query_end_ptr, ...]
movup.3 drop
movup.3 drop
#=> [eval1, eval0, x, index, query_ptr, query_end_ptr, ...]
# IV)
#
# Store [poe, index, eval_1, eval_0] where poe := g^index = x / offset and prepare stack
# for next iteration.
## a) Compute poe
##
## Cycles: 4
movup.3 movup.3
exec.constants::domain_offset_inv mul
#=> [poe, index, eval1, eval0, query_ptr, query_end_ptr, ...]
## b) Store [eval0, eval1, index, poe]
##
## Cycles: 5
dup.4 add.1 swap.5
mem_storew
#=> [poe, index, eval1, eval0, query_ptr+1, query_end_ptr, ...]
## c) Prepare stack for next iteration
##
## Cycles: 8
dropw
dup.1 dup.1
neq
#=> [?, query_ptr+1, query_end_ptr, ...]
end
drop drop
end
# Input: [Y, Y, Acc, P, ...]
# Output: [Y, Y, Acc, P, ...]
# Cycles: 133 for 2^20 trace length and at most 150
export.combine_aux_trace_columns
mstream exec.random_combinate_2
#=> [Y, y, y, tk1, tk0, Acc, P, ...]
# where (tk0, tk1) is the value of the column which will be opened at >= 3 points
swapw drop drop
#=> [tk1, tk0, Y, Acc, P, ...]
### prepare stack for Horner evaluation of p_k at x
#### load the negative of the number of coefficients padded to the next multiple of 4
push.TRACE_LENGTH_PLUS_ONE_NEXT_MUL_4_LOG_PTR mem_load neg
push.P_K_COEF_PTR
#=> [c_addr, ctr, tk1, tk0, Y, Acc, P, ...]
#### push accumulator and set eval point
dup.15
push.0.0
swap.2
push.0
#=> [x1, x0, acc1, acc0, c_addr, ctr, tk1, tk0, Y, Acc, P, ...]
# where (x0, x1) = (x, 0)
#### pad for mem_stream
movupw.2 padw
#=> [Y, Y, x1, x0, acc1, acc0, c_addr, ctr, tk1, tk0, Acc, P, ...]
### evaluate
push.1
while.true
mem_stream exec.horner_eval
end
#=> [Y, Y, x1, x0, acc1, acc0, c_addr, ctr', tk1, tk0, Acc, P, ...]
### compute tk - p_k(x)
swapdw
#=> [x1, x0, acc1, acc0, c_addr, ctr', tk1, tk0, Y, Y, Acc, P, ...]
movup.7 neg movup.7 neg
#=> [-tk1, -tk0, x1, x0, acc1, acc0, c_addr, ctr', Y, Y, Acc, P, ...]
movup.5 movup.5
ext2add
#=> [numer1, numer0, x1, x0, c_addr, ctr', Y, Y, Acc, P, ...]
push.0.0
#=> [0, 0, numer1, numer0, x1, x0, c_addr, ctr', Y, Y, Acc, P, ...]
swapw
push.TRACE_LENGTH_PLUS_ONE_NEXT_MUL_4_LOG_PTR mem_load neg
swap.4 drop
#=> [x1, x0, c_addr, ctr, 0, 0, numer1, numer0, Y, Y, Acc, P, ...]
movup.5 movdn.2
movup.5 movdn.2
#=> [x1, x0, 0, 0, c_addr, ctr, numer1, numer0, Y, Y, Acc, P, ...]
### multiply x by z^(-1)
padw push.Z_INV_PTR mem_loadw drop drop
#=> [z_inv1, z_inv0, x1, x0, 0, 0, c_addr, ctr, numer1, numer0, Y, Y, Acc, P, ...]
ext2mul
#=> [x'1, x'0, 0, 0, c_addr, ctr, numer1, numer0, Y, Y, Acc, P, ...]
# where x' = x / z
### prepare stack for evaluating product
swapdw
#=> [Y, Y, x'1, x'0, 0, 0, c_addr, ctr, numer1, numer0, Acc, P, ...]
### evaluate
push.1
while.true
mem_stream exec.horner_eval
end
#=> [Y, Y, x'1, x'0, tmp_deno1, tmp_deno0, c_addr, ctr, numer1, numer0, Acc, P, ...]
swapdw
#=> [x'1, x'0, tmp_deno1, tmp_deno0, c_addr, ctr, numer1, numer0, Y, Y, Acc, P, ...]
movup.5 movup.5
#=> [c_addr, ctr, x'1, x'0, tmp_deno1, tmp_deno0, numer1, numer0, Y, Y, Acc, P, ...]
push.Z_POWER_LOG_TRACE_PLUS_1 mem_loadw drop drop
#=> [z_pow_1, z_pow_0, tmp_deno1, tmp_deno0, numer1, numer0, Y, Y, Acc, P, ...]
### compute denominator
ext2mul
#=> [deno1, deno0, numer1, numer0, Y, Y, Acc, P, ...]
### compute full term of k-th column
ext2div
#=> [k_th_term_tmp1, k_th_term_tmp0, Y, Y, Acc, P, ...]
### multiply with alpha_k
movup.3 drop movup.3 drop
#=> [k_th_term_tmp1, k_th_term_tmp0, y, y, Y, Acc, P, ...]
swapw
#=> [Y, k_th_term_tmp1, k_th_term_tmp0, y, y, Acc, P, ...]
swapw.3
#=> [P, k_th_term_tmp1, k_th_term_tmp0, y, y, Acc, Y, ...]
# where P := [CURRENT_TRACE_ROW_PTR, OOD_TRACE_PTR, DEEP_RAND_CC_PTR, 0]
### update pointers and set flag
add.1 movdn.3
push.TRACE_LENGTH_PLUS_ONE_NEXT_DIV_BY_TWO_LOG_PTR mem_load add movdn.3
add.1 movdn.3
drop push.1 movdn.3 # flag associated to constraint composition terms
#=> [P, k_th_term_tmp1, k_th_term_tmp0, y, y, Acc, Y, ...]
swapw.3
#=> [Y, k_th_term_tmp1, k_th_term_tmp0, y, y, Acc, P, ...]
### load alpha_k
dup.14 sub.1 mem_loadw
#=> [y, y, alpha_k1, alpha_k0, k_th_term_tmp1, k_th_term_tmp0, y, y, Acc, P, ...]
movdn.5 movdn.5
#=> [alpha_k1, alpha_k0, k_th_term_tmp1, k_th_term_tmp0, Y, Acc, P, ...]
ext2mul
#=> [k_th_term1, k_th_term0, Y, Acc, P, ...]
### store k_th term for later
push.K_TH_TERM_PTR mem_storew
#=> [k_th_term1, k_th_term0, Y, Acc, P, ...]
### setup stack to correct shape
push.0.0
#=> [Y, Y, Acc, P, ...]
end
Looks great! I didn't go through the procedures line-by line - but the overall approach makes sense. To summarize:
The new expression for the DEEP composition polynomial looks something like this:
$$ Y(x) = \sum_{i=0}^k{( \alpha_i \cdot (\frac{T_i(x) - T_i(z)}{x - z} + \frac{T_i(x) - Ti(z \cdot g)}{x - z \cdot g}) )} + \beta \cdot \sum{i=0}^{log(n)}\frac{T_l(x) - T_l(z_i)}{x - zi} + \sum{i=0}^m{\gamma_i \cdot \frac{H_i(x) - H_i(z)}{x - z}} $$
Where, $T_l(x)$ is the Lagrange kernel column (in the auxiliary trace), and $z_i$'s come from the following sequence:
$$ z, z \cdot g, z \cdot g^2, z \cdot g^4, ..., z \cdot g^{2^{log(n) - 1}} $$
Adding the term for the Lagrange kernel column adds complexity (the fact that the total number of auxiliary columns goes down to 2 doesn't help us much here). The cost of this complexity is:
heval
or maybe horneval
). But we'd also need this instruction to help with sumcheck evaluation.I think it may be possible to reduce the cost per FRI query to $\approx 500$ by introducing one more specialized instruction (specifically for computing the Lagrange kernel column term), but this is an optimization we can do later.
As things stand now, the overall cost of computing DEEP composition polynomial goes up by slightly more than 4K cycles (assuming 28 FRI queries), which I think is acceptable.
A few questions:
horneval
instruction? In the above procedures, we use them in while
loops, and I'm wondering if we can replace that with repeat
loops.rcomb2
instruction - but as you noted, the degree goes up by 1. What would be the overall degree now?What would be the semantics of
horneval
instruction? In the above procedures, we use them inwhile
loops, and I'm wondering if we can replace that withrepeat
loops.
Here is the proposal described here
#! +-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+---+ #! | C01 | C00 | C11 | C10 | C21 | C20 | C31 | C30 | x1 | x0 | acc1 | acc0 |c_addr| ctr | - | - | #! +-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+---+ #! #! || #! \/ #! #! +---+-------+-------+-------+-------+-------+-------+-------+-------+------+------+-------+-------+------+-----+-----+ #! | b | C01 | C00 | C11 | C10 | C21 | C20 | C31 | C30 | x1 | x0 | acc1' | acc0' |c_addr| ctr'| - | #! +---+-------+-------+-------+-------+-------+-------+-------+-------+------+------+-------+-------+------+-----+-----+ #! #! where: #! #! 1. acc' = (acc0', acc0') := ((((acc * x + c3) * x + c2) * x + c1) * x) + c0 #! 2. ctr' = ctr + 4 #! 3. b = 0 if ctr' == 0 else 1 #! #! Cycles: 1 export.horner_eval end
Regarding using a repeat instead of a while loop, it should be doable as Horner's method can accommodate evaluating polynomials of degree less than an upper bound in a pretty straightforward manner. The only thing needed is to pad the high order coefficients to have value zero. This shouldn't complicate things too much as far as I can see. However, having an instruction that can support an arbitrary degree bound is preferable in my opinion. Falcon signature would benefit from such an instruction as an example.
I like the modification to the
rcomb2
instruction - but as you noted, the degree goes up by 1. What would be the overall degree now?
The modified instruct does the following $$(r, p) = (r, p) + \left(\alpha_i\cdot\left({T_i(x) - T_i(z)}\right), (1 - b)\cdot\alpha_i\cdot\left({T_i(x) - T_i(z \cdot g)} \right)\right)$$ I am counting here degree $3$ constraints for this, $2$ (extension) field multiplication and 1 (base) field multiplication by $1 - b$. The other thing that is missing is the memory consistency check between the values in the helper registers and the pointers sitting on the stack. As far as I can see these shouldn't affect the count.
I think the last open question is how many extra stack cycles would the sumcheck (and other GKR-related) operations take. Do we have an estimate here?
We should be able to zoom in on a good estimate of that soon.
However, having an instruction that can support an arbitrary degree bound is preferable in my opinion. Falcon signature would benefit from such an instruction as an example.
I think we can accomodate arbitrary degree bounds with a repeat
loop as well as long as the bound is know at compile time (which I think is the case for Falcon signature?).
The main benefit is that as long as the actual values are not too far from the bound, using repeat
should be quite a bit more efficient. For example:
repeat.8
mem_stream
horneval
end
Would be exactly 16 cycles, while:
while.true
mem_stream
horneval
end
assuming it terminates after 8 iterations, would be 32 cycles (as while.true
and end
instructions each require a cycle) and would also require 8 hashes to be computed in the hasher chiplet.
Btw, I noticed that as written the degree of the instruction would be 5, which is quite high. But we can reduce it by using some of the output slots for temp values.
However, having an instruction that can support an arbitrary degree bound is preferable in my opinion. Falcon signature would benefit from such an instruction as an example.
I think we can accomodate arbitrary degree bounds with a
repeat
loop as well as long as the bound is know at compile time (which I think is the case for Falcon signature?).Yes, indeed. It is actually much more easier in the context of Falcon than in the context of random linear combinations. The degree is always 512.
The main benefit is that as long as the actual values are not too far from the bound, using
repeat
should be quite a bit more efficient. For example:repeat.8 mem_stream horneval end
Would be exactly 16 cycles, while:
while.true mem_stream horneval end
assuming it terminates after 8 iterations, would be 32 cycles (as
while.true
andend
instructions each require a cycle) and would also require 8 hashes to be computed in the hasher chiplet.
Makes sense! Then for all practical trace lengths, using a repeat based on an upper bound should be more performant than the while-loop based solution.
Btw, I noticed that as written the degree of the instruction would be 5, which is quite high. But we can reduce it by using some of the output slots for temp values.
That's correct. I was thinking that we could do something like we did with the folding instruction where we use some of the stack values for degree reduction, as you just described.
Actually, I forgot to mention that for Falcon, we might gain some performance benefits using Horner evaluation as it will save around 10k cycles that are spent in computing powers of the $\tau$ and setting a region of memory to the all zero polynomial.
One thing in relation to Falcon is that operations are in the base field. But I guess we can just pad words in the memory layout?
As far as I can see, the three polynomials are loaded from the advice tape and hence we can just load them as extension field elements. This will mean that some of the repeat
blocks will double but this is still more than compensated for the gains described above.
Say we are working with a trace of length $2^{\nu}$ and we have $2^{\mu}$ fractions of the form $\frac{\mathsf{p}(i0,\dots,i{\mu - 1}, s_1(x0,\dots,x{\nu - 1}),\dots,s_n(x0,\dots,x{\nu - 1}))}{\mathsf{q}(i0,\dots,i{\mu - 1}, s_1(x0,\dots,x{\nu - 1}),\dots,s_n(x0,\dots,x{\nu - 1}))}$. Then we will have $$\rho := \mu + \nu$$ sum-check rounds in the final sum-check for the final GKR layer. We will also have $\rho$-number of rounds in total for all the sum-checks in the remaining GKR layers.
During each of the sum-check rounds in the last GKR layer, the (non-interactive) verifier receives the coefficients of a (univariate) polynomial of degree at most $d := \max_{(i_0,\dots,i_{\mu - 1})\in\{0,1\}^{\mu}}{\mathsf{deg}_{x_i}(\mathsf{p}(i_0,\dots,i_{\mu - 1}, s_1(x_0,\dots,x_{\nu - 1}),\dots,s_n(x_0,\dots,x_{\nu - 1}))) + \mathsf{deg}_{x_i}(\mathsf{q}(i_0,\dots,i_{\mu - 1}, s_1(x_0,\dots,x_{\nu - 1}),\dots,s_n(x_0,\dots,x_{\nu - 1}))) + 1}
$
The verifier then:
mstream
and horneval
).For the remaining $\rho$ sum-check rounds, the verifier does exactly the same as above but works instead with polynomials of degree at most $3$.
Thus the amount of work in cycles that the verifier does in the rounds of sum-check leading to the final GKR check is $$\rho\cdot\left(10 + \frac{d}{2}\right) + \rho\cdot\left(10 + 1\right)$$
At the end of all the sum-check rounds, the verifier would have sampled $(r0^{'},\dots,r{\mu - 1}^{'}, r0,\dots,r{\nu - 1})$ during the last sum-check. The verifier now:
\Sigma_{i=0}^{n - 1} \lambda^i s_i(x_0,\dots,x_{\nu - 1})
$ opens to $v$ at $(r0,\dots,r{\nu - 1})$.The cost of this step should be comparable to the current cost of constraint evaluation involving auxiliary columns.
During the sum-checks for the layers before the final layer, the verifier essentially receives $4$ values and reduces these to $1$ value using randomness.
The cost of these steps is basically one call to the hash function per GKR-layer (i.e., $\rho$ layers) and a few extension field operations (2 multiplications and 2 additions), say $20$ cycles.
The total estimated cost of the above, excluding the final GKR check, is approximately $$\rho\cdot\left(21 + \frac{d}{2}\right) + \rho\cdot\left(20 + 1\right)$$ i.e., $$\left(\mu + \nu \right)\cdot\left(42 + \frac{d}{2}\right)$$
For example, if $\mu = 6$, $\nu = 20$ and $d = 18$, we get a count of $1326$ cycles using our approximate estimate and without the cost of the final GKR check.
Note that, for the final sum-check, some of the polynomials (the last $\mu$ ones) sent to the verifier are of degree at most $3$ and thus the core work performed by the verifier (as well as the prover) is concentrated at the last $\nu$ steps, in addition to the final evaluation check.
As a part of recursive proof verification, we need to compute random linear combinations of all trace columns for all queried values. Mathematically, this is described as:
$$ \sum_{i=0}^k{\left(\alpha_i \cdot \frac{T_i(x) - T_i(z)}{x - z} +\beta_i \cdot \frac{T_i(x) - T_i(z \cdot g)}{x - z \cdot g} \right)} $$
Where $T_i$ is a trace polynomial, $x$ is a query location, $z$ is the out-of-domain point, and $\alpha_i$ and $\beta_i$ are random coefficients.
Currently, we have 72 base filed columns and 9 extension field columns. Thus, for each query we need to perform almost 200 subtraction, 200 multiplications, and 100 additions (I don't count devisions as they can be done twice per query). If we do this naively, this will probably require a couple thousand cycles per query, and since we need about 27 queries, this computation alone could require 60K+ cycles.
However, we can introduce two new VM operations to speed this computation up dramatically. Let's say our goal is to do this in about 10K cycles. We'll need two operations: one to work with main trace columns (in the base field) and the other one to work with the auxiliary trace columns (in the extension field). We'll also need to modify the way in which we generate random values (more on this later). Below, I'm going to describe only the first operation (the one working with base field columns), as the second operation would work very similarly.
The basic idea is that we'd like to combine this new operation (let's call it
RCOMB1
) with the already existingMSTREAM
operation. As a reminder,MSTREAM
works as follows:Where values $v_0, ..., v_7$ are located in memory starting at the address $a$. Said another way,
MSTREAM
loads two word from memory starting at address $a$ and increments $a$ by two to point at the next word pair.What
RCOMB1
could do is follow theMSTREAM
operation to compute the numerators of the above expression in a single cycle and add them into a running sum. Specifically, denoting running sum for the first term as $p$ and running sum of the second term as $r$, we want it to do the following:$$ p' = p + \alpha_i \cdot (T_i(x) - T_i(z)) $$
$$ r' = r + \beta_i \cdot (T_i(x) - T_i(z \cdot g)) $$
Once we do the above for all $i$, we can get the final result by computing:
$$ \frac{p}{x - z} + \frac{r}{x - z \cdot g} $$
And this can be computed using regular instructions.
Operation description
Let's assume that we use
MSTREAM
operation to load 8 values of $T_i(x)$ onto the stack. We'd like to follow it up with 8 invocations ofRCOMB1
to "absorb" all 8 values into the running sums, and then we can repeat the process. This would look like so:To make the above work, here is how we can define
RCOMB1
operation:In the above:
T0, ..., T7
are the $T_i(x)$ values loaded by callingMSTREAM
prior toRCOMB1
invocations.p0, p1
is the running sum $p$ (we need two elements since $p$ is in the extension field).r0, r1
is the running sum $r$ (it is also in the extension field).x_addr
is the memory address from which we are loading $T_i(x)$ values.z_addr
is the memory address from which we are loading $T_i(z)$ and $T_i(z \cdot g)$ values.Tz0, Tz1 and Tzg0, Tzg1
are the $T_i(z)$ and $T_i(z \cdot g)$ values respectively.a0, a1
is a random coefficient $\alpha$.Conceptually, the operation would work as follows:
z_addr
into helper registers. This assumes that these values are stored in a single word. This is not how it works currently, but it shouldn't be too difficult to change this.z_addr
so that it points to the next word for the subsequent calls toRCOMB1
.T0
in the picture above). Notice that we use $\alpha_i^2$ here instead of $\beta_i$. Again, this is not how it works currently, but we should be able to change this as well.RCOMB1
it will be $3^8$. We should understand if this is an issue.T1
moves into the 8th stack slot to prepare for the next invocation ofRCOMB1
.Given the above expressions, transition max constraint degree for this operation should be 3.
Cycle estimate
Assuming the above works, we can expect the following costs per query:
Main trace:
MSTREAM
and 8 invocations ofRCOMB1
).Overall, we get about 150 cycles per query, or about 4K cycles for 27 queries.
Auxiliary trace: Let's round up the number of auxiliary trace columns from 9 to 12 to be conservative. The costs would be:
MSTREAM
and 4 invocations ofRCOMB2
).Summing the above together for 27 queries we get about 2.3K cycles.
So, for both main and auxiliary trace columns, we should be able to compute random linear combinations in under 7K cycles.