Field/Modular arithmetic design to support efficient arithmetic on any moduli (including even)

mratsim commented 2 years ago

edited: Field == Modular arithmetic for prime modulus only

Some ideas of modular arithmetic design.

Many algorithms are only applicable to prime moduli (Legendre symbol, Fermat's Little Theorem, Square roots ...) or odd moduli (Jacobi symbol, GCD and Modular inversion via extended Euclid, Montgomery arithmetic). There are also specialized routines for binary extension fields (mod 2ᵏ) and ternary extension fields (mod 3ᵏ) but there is nothing for generic even moduli.

What we can do is represent generic modulus on a dual base power-of-two 2ᵏ and odd similar to the Residue Number System representation. The results of all operations can be recombined using the Chinese Remainder Theorem.

Concretely we would have

when sizeof(int) == 8 and not defined(Megalo32):
  type Word* = uint64
else:
  type Word* = uint32

type
  BigInt* = object
    ## Public BigInt repr
    raw: RawInt
    isNeg: bool
  RawInt = object
    ## Backend BigInt repr, unsigned
    limbs: seq[Word]

  ModKind = enum
    kPrime
    kOdd
    kBinary
    kTernary
    kGeneric

  PrimeModular* = object
    ## element (mod p) with p prime
    mres: RawInt ## Montgomery Residue repr
  OddModular* = object
    ## element (mod m) with m odd
    mres: RawInt ## Montgomery Residue repr
  BinaryModular* = object
    ## element (mod 2ᵏ)
    raw: RawInt
  TernaryModular* = object
    ## element (mod 3ᵏ)
    raw: RawInt
  Modular* = object
    ## element (mod m)
    oddBase: OddModular
    binBase: BinModular

  PrimeModulus* = object
    # Context object for prime fields
    raw: RawInt
    m0ninv: Word ## 1/p₀ (mod 2⁶⁴), assuming 64-bit word
    # other metadata ...

  OddModulus* = object
    # Context object for odd modular arithmetic
    raw: RawInt
    m0ninv: Word ## 1/p₀ (mod 2⁶⁴), assuming 64-bit word
    # other metadata ...

  BinaryModulus* = object
    # Context object for binary extension fields
    # m = 2ᵏ
    k: int
    # other metadata ...

  TernaryModulus* = object
    # Context object for ternary extension fields
    # m = 3ᵏ
    k: int
    # other metadata ...

  Modulus* = object
    # Context object for arithmetic with any modulus
    # Alternatively, use an object variant
    oddBase: OddModulus
    binBase: BinaryModulus

Operations would then have the form:

func sum*(r: var Modular, a, b, Modular, m: Modulus)
func add*(a, b: Modular, m: Modulus): Modular

With the convention for sum for operation that modify a buffer and add for out-of-place functions.

Alternatively we can have the Modulus object be carried by Modular elements to allow convenient operators:

type
  PrimeModular* = object
    ## element (mod p) with p prime
    mres: RawInt ## Montgomery Residue repr
    modulus: PrimeModulus

  PrimeModulus = ref object
    # Context object for prime fields
    raw: RawInt
    m0ninv: Word ## 1/p₀ (mod 2⁶⁴), assuming 64-bit word
    # other metadata ...

func sum*(r: var Modular, a, b, Modular)
func `+`*(a, b: Modular): Modular
func `+=`*(a: var Modular, b: Modular)

This would require making it a ref object.

pietroppeter commented 2 years ago

sounds all very nice (not sure what the advantage of this approach - ergonomics? performance? correctness? - but I guess it is worth exploring), I found it odd (as a trained mathematician) to hear about OddField and in fact I think you mean modular arithmetic instead of field arithmetic, and odd field should be ring of integers modulo an odd integer and so on (Binary and Ternary fields are not fields...).

mratsim commented 2 years ago

Oops good point, I was on the implementation details and you're right. I'll change field to Modular. Binary fields do exist since 2 is prime though GF(2ᵏ) is technically a "binary extension field".

Ergonomics-wise, we can expose just PrimeField and the Modular type since they are likely to be the most used for starter.

Performance-wise it should be a win since RNS bases are very often used to accelerate cryptography in hardware ASIC/FPGA. In particular, we can use Montgomery representation for odd moduli but we can't for even, this avoids expensive modulo operation. And modulo a power-of-two is also trivial.

Correctness-wise, what do you mean? I'm not out to write an incorrect library :sweat_smile:

pietroppeter commented 2 years ago

Oops good point, I was on the implementation details and you're right. I'll change field to Modular.

👍 asbolutely, defintely sounded like a unintended slip while focusing on other stuff 😄

Binary fields do exist since 2 is prime though GF(2ᵏ) is technically a "binary extension field".

yes, you are right that field of power of a prime do exist but they are not the ring of integers module a prime power

Ergonomics-wise, we can expose just PrimeField and the Modular type since they are likely to be the most used for starter.

seems appropriate

Performance-wise it should be a win since RNS bases are very often used to accelerate cryptography in hardware ASIC/FPGA.

absolutely trust you on that :)

Correctness-wise, what do you mean? I'm not out to write an incorrect library 😅

mmh, probably I meant something more like safe-ness? the concept I was trying to express was helping the user of the library to avoid using the wrong algorithm for a given context (similar but not exactly like we protect the user to sum a float and an int)

anyway sounds this approach could give an improvement on all fronts!

mratsim commented 2 years ago

mmh, probably I meant something more like safe-ness? the concept I was trying to express was helping the user of the library to avoid using the wrong algorithm for a given context (similar but not exactly like we protect the user to sum a float and an int)

Oh that would be just internal details, external API would not expose all that, especially initially when we figure stuff out.

mratsim commented 2 years ago

Some benchmarks with a 381-bit moduli.

Note: My modular reduction is constant-time for the algorithm (i.e. always does iterations to cover the worst cases, even if a < m or if a and m have close number of bits), but it does use hardware division at the moment as I didn't have the time to reimplement hardware division. Concretely it simulates the worst case modulus of a generic bigint library

# To be placed in the "./build/" folder
# of unmerged branch
# https://github.com/mratsim/constantine/tree/mont_repr_refactor

import
    ../benchmarks/[bench_blueprint],
    ../constantine/config/[curves, type_ff],
    ../constantine/arithmetic/[bigints, finite_fields],
    ../constantine/io/io_bigints,
    ../helpers/prng_unsafe

proc report(op, desc: string, start, stop: MonoTime, startClk, stopClk: int64, iters: int) =
  let ns = inNanoseconds((stop-start) div iters)
  let throughput = 1e9 / float64(ns)
  when SupportsGetTicks:
    echo &"{op:<70} {desc:<18} {throughput:>15.3f} ops/s     {ns:>9} ns/op     {(stopClk - startClk) div iters:>9} CPU cycles (approx)"
  else:
    echo &"{op:<70} {desc:<18} {throughput:>15.3f} ops/s     {ns:>9} ns/op"

template bench(op: string, desc: string, iters: int, body: untyped): untyped =
  block:
    measure(iters, startTime, stopTime, startClk, stopClk, body)
    report(op, desc, startTime, stopTime, startClk, stopClk, iters)

# warmup
proc warmup*() =
  # Warmup - make sure cpu is on max perf
  let start = cpuTime()
  var foo = 123
  for i in 0 ..< 300_000_000:
    foo += i*i mod 456
    foo = foo mod 789

  # Compiler shouldn't optimize away the results as cpuTime rely on sideeffects
  let stop = cpuTime()
  echo &"Warmup: {stop - start:>4.4f} s, result {foo} (displayed to avoid compiler optimizing warmup away)\n"

warmup()

proc mulModBench*(iters: int) =
  # BLS12-381 prime
  const m = BigInt[381].fromHex"0x1a0111ea397fe69a4b1ba7b6434bacd764774b84f38512bf6730d2a0f6b0f6241eabfffeb153ffffb9feffffffffaaab"

  var a = rng.random_unsafe(BigInt[381])
  a.reduce(a, m)
  var b = rng.random_unsafe(BigInt[381])
  b.reduce(b, m)

  var r: BigInt[381]
  var t: BigInt[762]

  # mod has no branches but use hardware DIV
  bench("Mul + constant-time+hardware mod (a.k.a always worst-case)", "BLS12-381", iters):
    t.prod(a, b)
    r.reduce(t, m)

  let x = rng.random_unsafe(Fp[BLS12_381])
  let y = rng.random_unsafe(Fp[BLS12_381])

  var z: Fp[BLS12_381]

  bench("Montgomery multiplication", "BLS12-381", iters):
    z.prod(x, y)

  bench("BigInt <-> Montgomery roundtrip", "BLS12-381", iters):
    r.fromField(z)
    z.fromBig(r)

mulModBench(10000)

mratsim commented 10 months ago

See Implementing fast modular exponentiation - a guide - https://github.com/status-im/nim-stint/issues/126

And implementation: https://github.com/mratsim/constantine/pull/242#issuecomment-1568223941

Speed is at least 85% of GMP of modular exponentiation, the most CPU intensive modular arithmetic primitive.

It confirms that splitting between odd modulus and power-of-two modulus and recombining when needed via the Chinese Remainder Theorem is the way to go.

SciNim / theo

Field/Modular arithmetic design to support efficient arithmetic on any moduli (including even) #2