Prio3: Soundness improvement for reductions

Prio3Sum, Prio3SumVec, Prio3Histogram, and Prio3MultihotCountVec all involve checking that each of $n$ "sub-checks" outputs $0$. This works by interpreting the outputs as coefficients of a polynomial $p$, then evaluating $p$ at a random point $r\in F$. If $p(r)=0$, then we deem the input valid. Some of us have been calling this the circuit's "reduction".

Our current analysis of the circuit's soundness goes as follows. When the input is invalid, i.e., at least one sub-check output is non-zero, the probability that $p(r) = 0$ is equal to the probability that $r$ is a root of $p$, of which there are at most $\deg{p} / |F|$. The degree of $p$ is at most $n$, which gives us a soundness error of $n / |F|$.

This is also given by the Schwartz-Zippel lemma. More generally, let $p$ is a polynomial of $n$ variables with at least one non-zero coefficient. When we evaluate $p$ at a random point $(r_1, \ldots, r_n) \in F^n$, the probability that $p(r_1, \ldots, r_n) = 0$ is at most $\deg p / |F|$, where $\deg p$ denotes in particular the total degree of $p$.

This lemma suggests an alternative reduction with better soundness. Let $c_1, \ldots, c_n$ denote the sub-check outputs. $p(x_1, \ldots, x_n) = c_1 \cdot x_1 + \cdots + c_n \cdot x_n$ be the polynomial with $n$ variables that has $c_1 \ldots, c_n$ as its coefficients. Choose a random vector $\vec{r} = (r_1, \ldots, r_N) \in F^n$ and evaluate $p$ at $\vec{r}$. If $p(\vec{r}) = 0$, then deem the input valid.

Soundness: since $\deg p = 1$, by the Schwartz-Zippel lemma it follows that the soundness error for this circuit is $1 / |F|$.

This is a pretty significant improvement, especially for larger inputs with lots of sub-checks. The base soundness of the current circuit is $\frac{n}{|F|} + \frac{2m}{|F|-m}$, where the second term is from {{BBCGGI19}}, Theorem 4.3; the soundness of the new circuit is $\frac{1}{|F|} + \frac{2m}{|F|-m}$. Concretely, for Prio3SumVec with bits == 1, length = 100_000 and chunk_length = 316 , this drops from $~ 2^{-111}$ to $2^{-118}$:

def sound(n, m, f): return -int((log(n/f + (2*m)/(f-m))/log(2)).n())

There is a catch: we end up with a much longer joint randomness, which means more times spent on XOF evaluation. On the other hand, we spend less time on finite field arithmetic, since we don't need to compute powers of $r$. Let's benchmark this, and if we don't see a performance degradation, then I think we should adopt this new reduction in each of the circuits, as well as for the built-in reduction for circuits with multiple outputs.

Credit to @rozbb for pointing out this use of Shwartz-Zippel and thanks to @bwesterb and @bhalleycf for confirming the math.

This does indeed cost us in terms of runtime: up to 10% for sharding and up to 5% for preparation:

[exp/sum_vec_alternative_reduction][~/github.com/cjpatton/libprio-rs]$ cargo bench -- sumvec
   Compiling prio v0.16.7 (/Users/christopherpatton/github.com/cjpatton/libprio-rs)
    Finished `bench` profile [optimized] target(s) in 2.64s
     Running benches/cycle_counts.rs (target/release/deps/cycle_counts-79a8c47b35862616)
Unexpected error while launching valgrind. Error: No such file or directory (os error 2)
     Running benches/speed_tests.rs (target/release/deps/speed_tests-54fc5ed5311a9071)
Gnuplot not found, using plotters backend
prio3sumvec_shard/serial/10
                        time:   [13.449 µs 13.458 µs 13.467 µs]
                        change: [+0.1918% +0.9733% +1.5465%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
prio3sumvec_shard/serial/100
                        time:   [45.367 µs 45.411 µs 45.457 µs]
                        change: [+8.7190% +8.9538% +9.1834%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
prio3sumvec_shard/serial/1000
                        time:   [656.54 µs 657.42 µs 658.73 µs]
                        change: [+9.8910% +10.270% +10.584%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe
prio3sumvec_shard/serial/100000
                        time:   [49.211 ms 49.280 ms 49.366 ms]
                        change: [+9.6781% +9.9782% +10.291%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  6 (6.00%) high mild
  2 (2.00%) high severe

prio3sumvec_prepare_init/serial/10
                        time:   [9.8698 µs 9.8849 µs 9.9031 µs]
                        change: [+0.4939% +0.7480% +0.9854%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 18 outliers among 100 measurements (18.00%)
  4 (4.00%) high mild
  14 (14.00%) high severe
prio3sumvec_prepare_init/serial/100
                        time:   [23.629 µs 23.643 µs 23.662 µs]
                        change: [+5.1860% +5.5778% +5.8787%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 12 outliers among 100 measurements (12.00%)
  5 (5.00%) high mild
  7 (7.00%) high severe
prio3sumvec_prepare_init/serial/1000
                        time:   [221.06 µs 221.24 µs 221.48 µs]
                        change: [+4.3432% +4.7894% +5.0854%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  4 (4.00%) high mild
  4 (4.00%) high severe

Branch with the modified SumVec circuit: https://github.com/cjpatton/libprio-rs/commit/789d5fea6b4e61f5124021725934c5a8c11d97ad

I think the improvement to soundness is worth the mild hit to performance.

Actually, there's another way to tune this.

The idea is that each call the ParallelSum gadget gets its own fresh $r$, and within the gadget, we compute powers of $r$. That way we end up with joint randomness length equal to the number of gadget calls, and the totoal degree of the polynomial is at most the chunk length. When we choose the chunk length to be $\sqrt{n}$, we end up with a soundness bound of

$\frac{\sqrt{n}}{|F|} + \frac{2\sqrt{n}}{|F|-\sqrt{n}}$

(Assuming $n$ is a perfect square; otherwise the bound is numerically close.) The second term is still dominant, so this doesn't actually lose much in terms of bits of security. For bits == 1, length = 100_000, and chunk_length = 316, we still get $2^{-118}$.

This doesn't result in a noticeable performance difference for our benchmarks: https://github.com/cjpatton/libprio-rs/commit/0540c24e522bb7cb00c5dff8c28b20a45f2375

cfrg / draft-irtf-cfrg-vdaf

Prio3: Soundness improvement for reductions #427