Initial constant-time stack-allocated Bernstein-Yang

tarcieri commented 1 month ago

The previous implementation runs in variable-time with respect to g. However in the event both inputs are secret a fully constant-time implementation is required.

This implements the method described in section 11 of https://eprint.iacr.org/2019/266.pdf and more specifically this Python code from Figure 11.1:

from divsteps2 import divsteps2

def iterations(d):
      return (49*d+80)//17 if d<46 else (49*d+57)//17

def gcd2(f,g):
      assert f & 1
      d = max(f.nbits(),g.nbits())
      m = iterations(d)
      delta, fm, gm, P = divsteps2 (m,m+d,1,f,g)
      return abs(fm)

def recip2(f,g):
      assert f & 1
      d = max(f.nbits(),g.nbits())
      m = iterations(d)
      precomp = Integers(f)((f+1)/2)^(m−1)
      delta, fm, gm, P = divsteps2(m,m+1,1,f,g)
      V = sign(fm)*ZZ(P[0][1]*2^(m-1))
      return ZZ(V*precomp)

Instead of bounding the loop on g reaching zero, this instead computes a fixed number of iterations relative to the highest bit of either f or g after which the algorithm will converge, then runs for that number of iterations instead.

This results in about a 22X performance impact:

Bernstein-Yang invert, U256
        time:   [36.934 µs 37.151 µs 37.391 µs]
        change: [+2256.7% +2267.7% +2278.4%] (p = 0.00 < 0.05)
        Performance has regressed.

The previous implementation which is variable-time with respect to g is preserved as well, for now as gcd_vartime, but it would also be nice to add an inv_mod_vartime as well.

tarcieri commented 1 month ago

Aside from a similar change to the Boxed* implementation, I think this is the last bit of timing variability needed to address #627

tarcieri commented 1 month ago

Note: there's an alternative strategy for computing the upper bounds here which we should investigate https://github.com/sipa/safegcd-bounds

fjarri commented 1 month ago

I wonder if an implementation of gcd_vartime that just uses rem_vartime would be faster or not.

tarcieri commented 1 month ago

I should probably add benchmarks for gcd_vartime as well

tarcieri commented 1 month ago

Added some benchmarks in f2a5aed.

For reference:

greatest common divisor/gcd, U256
                        time:   [40.901 µs 41.114 µs 41.391 µs]

greatest common divisor/gcd_vartime, U256
                        time:   [1.5651 µs 1.5714 µs 1.5796 µs]

wrapping ops/div/rem_vartime, U256/U128, full size
                        time:   [87.533 ns 87.881 ns 88.262 ns]

wrapping ops/rem_vartime, U256/U128, full size
                        time:   [87.768 ns 88.654 ns 90.059 ns]

@fjarri I guess the idea would be to use rem_vartime to implement Euclid's method? But what's interesting about a gcd_vartime or inv_mod_vartime using Bernstein-Yang is it's constant time with respect to f (but not to g). I'm not sure you can implement Euclid's algorithm in such a manner.

fjarri commented 1 month ago

That's true, but an implementation vartime in both arguments is useful too, so I wonder if removing the restriction on the first argument leads to a noticeable performance gain.

tarcieri commented 1 month ago

@fjarri I guess we could potentially have all three options, but I'm not sure about naming.

One question I'd have is what is the use case for a fully variable time GCD in cryptographic algorithms.

fjarri commented 1 month ago

I tried out a Euclidean algorithm implementation with rem_vartime(), and it seems to be significantly slower than the current gcd_vartime() (about 10x).

RustCrypto / crypto-bigint

Initial constant-time stack-allocated Bernstein-Yang #632