drahnr / rs-ec-perf

13 stars 7 forks source link

Arithmetic in GF(2^16) #7

Open burdges opened 3 years ago

burdges commented 3 years ago

We describe some alternatives for arithmetic in GF(2^16) for use at different stages in FFTs and in the formal derivative. We've currently construct GF(2^16) directly using 128k LOG and EXP tables, which we inherited from the GF(2^8) case.

Aside from the field implementation, we've three more large tables, skew factors used in the FFT, the B table used in preparing and finishing the formal derivative, and the Walsh transform table used in preparing the error locator polynomial. We inherited all three additional tables being stored in log form for faster arithmetic, so all three change if we change the field representation.

We only compute the error locator polynomial once per reconstruction, which makes it much less important. We therefore explore how representing the other two tables interacts with the multiplication.

As an aside, there exist PCMUL designs ala #1 that do GF(2^2)[X] multiplicaitons and then reduce. We've anecdotal reports these fair poorly compared with other approached, and our one experiment fared poorly, but we never tried to squeeze two field elements into the 64 bit width perhaps. We expect these still prove suboptimal compared with other approaches since the three additional tables can be represented in log form.

Instead we'll construct GF(2^8)[X]/p(X) where p(X) = X^2 + omega X + 1 presumably with omega chosen like in section 3.1.3 page 34 of Ming-Shing Chen's thesis. Any such p(X) has all terms non-zero because the Frobenius is an automorphism. We're left with three choices for GF(2^8) multiplication:

  1. two 256 byte LOG and EXP tables for GF(2^8),
  2. one multiplication table for GF(2^4), and
  3. one 64k 256-by-256 multiplication table for GF(2^8).

We achieve 3 in three 64k table lookups using Karatsuba, ala 3. We cannot use Karatsuba if using LOG tables, which removes choices like one 64k 256-by-256 multiplication table for GF(2^8) with one side in LOG form.

We write out a Karatsuba-free multiplication for 1 like

(x T + y) (u T + v)
 =  x u T^2 + (x v + y u) T + y v
 = (x (b u + v) + y u) T + (x u + y v)   where T^2 = b T + 1 ala section 3.1.3 page 34 of Ming-Shiung Chen's thesis

We then decompose this multiplication into two functions

prepare(u T + v) -> (v_log,u_log, w_plus_v_log)
where
   v_log = LOG[v]
   u_log = LOG[u]
   w = b u    // w = EXP[b_log + u]
   w_plus_v_log = LOG[w ^ v]

mul_prepared( (x T + y), (v_log,u_log, u_plus_v_log) )  -> (z, xu_plus_xv)
where
   x_log = LOG[x]
   y_log = LOG[y]
   xu_plus_xv = EXP[x_log + u_log] ^ EXP[y_log + u_log]
   z = EXP[w_plus_v_log + x]

We're therefore expecting five lookups from two 256 byte tables plus one addition, although this becomes eight without preparation.

We learned about Intel's GFNI instruction from Ming-Shing Chen, so we'll need to select a compatible field representation, which influences the b usable above.

We'll compare with Karatsuba for GF(((2^4)^2)^2) = GF((2^4)^2)[T] / p(T), where again p(T) = T^2 + b T + 1 ala section 3.1.3 page 34 of Ming-Shiung Chen's thesis:

(x T + y) (u T + v)
 =  x u T^2 + (x v + y u) T + y v
 =  x u T^2 + ((x+y)(u+v) + x u + y v) T + y v

Involves three Karatsuba multiplications, each of which requires three look ups from one 256 byte table, so nine lookups total, but from one one 256 byte table and no addition. As a rule, one should recuse on Karatsuba but this already gives the asymptotic n^{log_2 3} so not sure if we gain anything.


Unfinished below this line:

(x_3 T^3 + x_2 T^2 + x_1 T + x_0) (y_3 T^3 + y_2 T^2 + y_1 T + y_0)
 = x_3 y_3 T^6
 + (x_3 y_2 + x_2 y_3) T^5
 + (x_3 y_1 + x_2 y_2 + x_1 y_3) T^4
 + (x_3 y_0 + x_2 y_1 + x_1 y_2 + x_0 y_3) T^3
 + (x_2 y_0 + x_1 y_1 + x_0 y_2) T^2
 + (x_1 y_0 + x_0 y_1) T
 + x_0 y_0

Assuming we compute x_1 y_1 and x_2 y_2, if we also compute (x_1 + x_0)(y_1 + y_0) and (x_3 + x_2)(y_3 + y_2) then we can compute x_1 y_0 + x_0 y_1 = (x_1 + x_0)(y_1 + y_0) + x_0 y_0 + x_1 y_1 and x_3 y_2 + x_2 y_3 = .. too.

devillegna commented 3 years ago

For the last unfinished computation, please refer to section "Two-level seven-way recursion" in "Batch binary Edwards" by Daniel J. Bernstein.

burdges commented 3 years ago

We should explore doing two degree two extensions of GF(2^4) via log prepared multiplications because afaik we cannot benefit from Karatsuba after doing the lower layer using 16 byte log tables.

We start with two degree two extensions like

[(a_3 x + a_2) y + (a_1 x + a_0)] [(b_3 x + b_2) y + (b_1 x + b_0)]
= (a_3 x + a_2) (b_3 x + b_2) y^2
 + [(a_3 x + a_2) (b_1 x + b_0) + (b_3 x + b_2) (a_1 x + a_0)] y
 + (a_1 x + a_0) (b_1 x + b_0)

We then assume y^2 = (alpha_1 x + alpha_0) y + 1 so that

= [(a_3 x + a_2) (b_1 x + b_0) + (a_1 x + a_0) (b_3 x + b_2) + (a_3 x + a_2) (alpha_1 x + alpha_0) (b_3 x + b_2)] y
 + [(a_1 x + a_0) (b_1 x + b_0) + (a_3 x + a_2) (b_3 x + b_2) ] 

so this amounts to four direct prepared multiplications using 16 byte GF(2^4) log tables and one messy term. Any one of these looks like above, so

(u_1 x + u_0) (v_1 x + v_0) = 
 = u_1 v_1 x^2 + (u_1 v_0 + u_0 v_1) x + u_0 v_0
 = (u_1 v_0 + u_0 v_1 + u_1 beta v_1) x + (u_0 v_0 + u_1 v_1)
 = (u_1 (beta v_1 + v_0) + u_0 v_1) x + (u_0 v_0 + u_1 v_1)

after applying x^2 = beta x + 1.

We now expand this messy term like

(a_3 x + a_2) (alpha_1 x + alpha_0) (b_3 x + b_2)
= a_3 alpha_1 b_3 x^3 
 + [a_2 alpha_1 b_3 + a_3 (alpha_0 b_3 + alpha_1 b_2)] x^2
 + [a_3 alpha_0 b_2 + a_2 (alpha_1 b_2 + alpha_0 b_3)] x
 + a_2 alpha_0 b_2

We also assume x^2 = beta x + 1 so that

= [a_2 alpha_1 b_3 + a_3 (alpha_0 b_3 + alpha_1 b_2) + beta alpha_1 b_3] x^2
 + [a_3 alpha_0 b_2 + a_2 (alpha_1 b_2 + alpha_0 b_3) + a_3 alpha_1 b_3] x
 + a_2 alpha_0 b_2
= [a_2 alpha_1 b_3 + a_3 ((alpha_0 + beta alpha_1) b_3 + alpha_1 b_2)] x^2
 + [a_3 (alpha_0 b_2 + alpha_1 b_3) + a_2 (alpha_1 b_2 + alpha_0 b_3)] x
 + a_2 alpha_0 b_2
= [a_2 beta alpha_1 b_3 + a_3 beta ((alpha_0 + beta alpha_1) b_3 + beta alpha_1 b_2)] x
 + [a_3 (alpha_0 b_2 + alpha_1 b_3) + a_2 (alpha_1 b_2 + alpha_0 b_3)] x
 + [a_2 alpha_0 b_2 + a_2 alpha_1 b_3 + a_3 ((alpha_0 + beta alpha_1) b_3 + alpha_1 b_2)]
= [ a_3 ((beta alpha_1 + alpha_0) b_2 + (beta alpha_0 + beta^2 alpha_1 + alpha_1) b_3)
  + a_2 (alpha_1 b_2 + (alpha_0 + beta alpha_1) b_3) ] x
 + [ a_2 (alpha_0 b_2 + alpha_0 b_3)
   + a_3 ((alpha_0 + beta alpha_1) b_3 + alpha_1 b_2) ]

We know

(a_3 x + a_2) (b_1 x + b_0)
 = (a_3 (b_0 + beta b_1) + a_2 b_1) x + (a_2 b_0 + a_3 b_1)

so we conclude the y components of [(a_3 x + a_2) y + (a_1 x + a_0)] [(b_3 x + b_2) y + (b_1 x + b_0)] turns out to be (a_1 x + a_0) (b_3 x + b_2) plus

a_3 ((beta alpha_1 + alpha_0) b_2 + (beta alpha_0 + beta^2 alpha_1 + alpha_1) b_3 + beta b_1 + b_0)
  + a_2 (alpha_1 b_2 + (alpha_0 + beta alpha_1) b_3 + b_1) ] x
 + [ a_2 (alpha_0 b_2 + alpha_0 b_3 + b_0)
   + a_3 ((alpha_0 + beta alpha_1) b_3 + alpha_1 b_2 + b_1) ]

and non-y terms are given by (a_1 x + a_0) (b_1 x + b_0) + (a_3 x + a_2) (b_3 x + b_2).

After multiplier preparation, we therefore have four log tables lookups for the a_i and then three additions followed by exp table lookups for each of the three basic products, plus another four additions followed by log table lookups, so 13 pairs of additions followed by exp table lookups.

burdges commented 3 years ago

We save slightly on multiplier preperation if alpha_0 = 0, but not sure that's possible. I'd assume alpha_1 = 0 is impossible.

burdges commented 3 years ago

We should transform this field extension ( GF(16)[x] / (x^2 + beta x + 1) )[y] / ( y^2 + (alpha_1 x + alpha_0) y + 1 ) into a one variable representation GF(16)[z] / (z^4 + ... + 1) for use with the current code and clmul. How do we select z?

Intuitively, any element not inside any proper subfield should work, so most z = x y + u x + vy + w work, but working out the details gets messy. As an example z = x y yields

z^2 = x^2 y^2
 = (beta x + 1) ((alpha_1 x + alpha_0) y + 1)
 = alpha_1 beta x^2 y + (alpha_0 beta + alpha_1) x y + alpha_0 y + beta x + 1
 = alpha_1 beta^2 x y + (alpha_0 beta + alpha_1 + alpha_1 beta) x y + alpha_0 y + beta x + 1
 = (alpha_1 (beta^2 + beta + 1) + alpha_0 beta) z + alpha_0 y + beta x + 1
 = gamma z + alpha_0 y + beta x + 1 
where   gamma = alpha_1 (beta^2 + beta + 1) + alpha_0 beta

z^4 = (gamma z + alpha_0 y + beta x + 1)^2
 = gamma^2 z^2 + beta^2 x^2 + alpha_0^2 y^2 + 1
 = gamma^2 (gamma z + alpha_0 y + beta x + 1)
    + alpha_0^2 ((alpha_1 x + alpha_0) y + 1) 
    + beta^2 (beta x + 1) + 1
 = (gamma^3 + alpha_0^2 alpha_1) z
   + (alpha_0^3 + alpha_0) y 
   + (beta^3 + beta) x 
   + alpha_0^2 + beta^2 + 1

These x and y terms look related to the Cantor basis, so maybe we've some fancier trick for solving this?