crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.4k stars 1.62k forks source link

Add operator *%, similar to &*, but for use in modular-multiplication #13308

Open jzakiya opened 1 year ago

jzakiya commented 1 year ago

Proposal

I'm proposing adding the operation *% to be used as (x *% y) % m.

Purpose

Here, (x *% y) will be internally performed as a double-precision multiplication to temporarily use twice the number of bytes for a given Int size to hold the multiplication result before being reduced modulo m for its size.

This will allow all Init::Primitive type modulo math to be done accurately, which now doesn't occur if the (x * y) multiplication overflows (i.e. doesn't fit within the size of m).

Rationale

Crystal added the & operand to be used as &* and &** to prevent certain overflows. However when doing modular-muliplications (x &* y) % m the results will still silently overflow if the resultant multiplication value is greater than the x|y types, causing an incorrect result, while producing no runtime error or warning.

To prevent this, the (x * y) operation must be done as a double-precision multiplication, whose result is temporarily stored as twice the size of the largest x|y int type. Then the modulo reduction is done to m's type size the resultant value will be accurate.

Currently if one performs the following code, (b &* b) will run and silently overflow if the result is greater than bs int size.

def powmod(b, e, m)
  r, b = 1, b % m
  while e > 0
    r = (r &* b) % m if e.odd?
    e >>= 1
    b = (b &* b) % m
  end
  r
end

To prevent this (b *% b) % m would first perform a double-precision multiply, then the modulo reduction, to preserve the full accuracy of the lost precision.

Use Cases

Using math for a given int type is faster than using BigInts. Thus, it's always better to prevent using BigInts as much as possible if highest performance is desired.

The following example shows where this proposal would make a great difference.

This code uses the GMP lib function to perform a powmod.

require "big"

@[Link("gmp")]
lib LibGMP
  fun mpz_powm = __gmpz_powm(rop : MPZ*, base : MPZ*, exp : MPZ*, mod : MPZ*)
end

def powmodgmp(b, e, m)
  y = BigInt.new
  LibGMP.mpz_powm(y, b.to_big_i, e.to_big_i, m.to_big_i)
  y
end

This works with any size int type, by first converting them all to BigInts to use in GMP function.

However when the inputs are not BigInts the Crystal powmod code is faster, but because of the cited issues, will not produce accurate results when the values for b cause overflows. To get the best of both worlds we can create a hybrid powmod function.

require "big"

@[Link("gmp")]
lib LibGMP
  fun mpz_powm = __gmpz_powm(rop : MPZ*, base : MPZ*, exp : MPZ*, mod : MPZ*)
end

def powmodgmp(b, e, m)
  y = BigInt.new
  LibGMP.mpz_powm(y, b.to_big_i, e.to_big_i, m.to_big_i)
  y
end

def powmodint(b, e, m)
  r = typeof(m).new(1)
  b = b % m;  b = typeof(m).new(b)
  while e > 0
    r = (r &* b) % m if e.odd?
    e >>= 1
    b = (b &* b) % m
  end
  r
end

def powmodopt(b, e, m)
  if m > UInt64::MAX
    r = powmodgmp(b, e, m)
  else
    m = m.to_u64
    r = powmodint(b, e, m)
  end
end

In the desired code the &* operations would be replace by *% (or equivalent).

Here powmodint works when the value of m is not (less than) a BigInt, no matter the type of b.

In powmodopt, if the value of m is greater than the largest int type it's a BigInt, so we use the GMP version.

Otherwise we use powmodint, which first reduces the value of b to fit in m and then assigns its type as ms. Now b and m values are of the same type. Thus doing the subsequent math is the fastest possible.

(Because currently, even though a UInt128 is the largest int type, you can't do BigInt.to_(i|u)128, so the code uses if m > UInt64::MAX to check the value of m. See https://github.com/crystal-lang/crystal/issues/12771)

Below are some benchmarks showing the performance for different value cases. Using the GMP version always produces accurate results. The accuracy using powmodopt varies based on the overflow issue but performance still stands.

require "benchmark"

def powmodtests(b, e, m)
  Benchmark.ips do |x|
    x.report("powmodopt") { powmodopt(b, e, m) }
    x.report("powmodgmp") { powmodgmp(b, e, m) }
    puts
  end
end

def tests(b, e, m)
  puts "r = #{r = powmodopt(b, e, m)} of class #{r.class} using powmodopt"
  puts "r = #{r = powmodgmp(b, e, m)} of class #{r.class} using powmodgmp"
  powmodtests(b, e, m)
  puts
end

puts
b = "329832983".to_big_i
e = 4843
m = "498422".to_big_i
tests b, e, m

b = "329832983".to_big_i
e = 4843
m = 498422.to_i64
tests b, e, m

b = "329832983".to_big_i
e = 4843
m = 498422.to_u64
tests b, e, m

b = "329832983".to_big_i
e = "4433923223".to_big_i
m = "498422".to_big_i
tests b, e, m

b = "329832983".to_big_i
e = 4433923223
m = 498422.to_i64
tests b, e, m

b = "329832983".to_big_i
e = 4433923223u64
m = 498422.to_u64
tests b, e, m

b = "3298329804304383423".to_big_i
e = 4433923224384234
m = "2372095723823498422".to_big_i
tests b, e, m

b = "3298329804304383423".to_big_i
e = 4433923224384234
m = 2372095723823498422.to_i64
tests b, e, m

b = "3298329804304383423".to_big_i
e = 4433923224384234
m = 2372095723823498422.to_u64
tests b, e, m

And here are the performance and accuracy results. Except in one case (for some reason) powmodopt is generally 3x faster.

Because these types of operations are generally used inside loops for many algorithms you can see the potential overall algorithmic performance increase with this feature.

r = 107323 of class UInt64 using powmodopt
r = 107323 of class BigInt using powmodgmp

powmodopt  12.13M ( 82.43ns) (± 0.30%)  16.0B/op        fastest
powmodgmp   3.88M (257.88ns) (± 0.20%)  32.0B/op   3.13× slower

r = 107323 of class UInt64 using powmodopt
r = 107323 of class BigInt using powmodgmp

powmodopt  14.60M ( 68.47ns) (± 0.21%)  16.0B/op        fastest
powmodgmp   3.73M (268.13ns) (± 0.46%)  48.0B/op   3.92× slower

r = 107323 of class UInt64 using powmodopt
r = 107323 of class BigInt using powmodgmp

powmodopt  14.59M ( 68.53ns) (± 0.28%)  16.0B/op        fastest
powmodgmp   3.71M (269.64ns) (± 0.41%)  48.0B/op   3.93× slower

r = 31565 of class UInt64 using powmodopt
r = 31565 of class BigInt using powmodgmp

powmodopt   1.23M (811.82ns) (± 0.26%)  1.05kB/op   1.86× slower
powmodgmp   2.29M (435.90ns) (± 0.25%)   16.0B/op        fastest

r = 31565 of class UInt64 using powmodopt
r = 31565 of class BigInt using powmodgmp

powmodopt   7.22M (138.57ns) (± 0.31%)  16.0B/op        fastest
powmodgmp   2.20M (454.80ns) (± 0.77%)  48.0B/op   3.28× slower

r = 31565 of class UInt64 using powmodopt
r = 31565 of class BigInt using powmodgmp

powmodopt   7.31M (136.86ns) (± 0.13%)  16.0B/op        fastest
powmodgmp   2.19M (456.24ns) (± 0.19%)  48.0B/op   3.33× slower

r = 288018135440070645 of class UInt64 using powmodopt
r = 1898779714053406193 of class BigInt using powmodgmp

powmodopt   4.73M (211.52ns) (± 0.08%)  16.0B/op        fastest
powmodgmp   1.71M (585.48ns) (± 0.17%)  32.0B/op   2.77× slower

r = 288018135440070645 of class UInt64 using powmodopt
r = 1898779714053406193 of class BigInt using powmodgmp

powmodopt   5.03M (198.83ns) (± 0.08%)  16.0B/op        fastest
powmodgmp   1.68M (596.07ns) (± 0.12%)  48.0B/op   3.00× slower

r = 288018135440070645 of class UInt64 using powmodopt
r = 1898779714053406193 of class BigInt using powmodgmp

powmodopt   5.07M (197.16ns) (± 0.09%)  16.0B/op        fastest
powmodgmp   1.67M (599.49ns) (± 0.16%)  48.0B/op   3.04× slower
HertzDevil commented 1 year ago

Does it have to be a new operator? Can't it be a regular method?

Also a naive implementation would require one to also implement Int256#%(Int128). Int256 doesn't even exist.

jzakiya commented 1 year ago

When I say operator I was referring to its use not its implementation details. So I would expect it to be a method, just as +, /, *, and - are.

I don't understand the second question about Int256#(Int128). There are currently no Int256, as you say, so that could never come into the equation currently. Unless I'm missing something in your question. If I am can you give me an example.

HertzDevil commented 1 year ago

When I say operator I was referring to its use not its implementation details.

But does it have to have an operator name? Alphanumeric names are preferable because they don't require modifications to Crystal's syntax.

To prevent this (b *% b) % m would first perform a double-precision multiply, then the modulo reduction, to preserve the full accuracy of the lost precision.

That means if b and m are Int128s, then Int128#*%(Int128) should return an Int256, and the result of that should call Int256#%(Int128) to return an Int128. Isn't that what you mean?

jzakiya commented 1 year ago

If you want to use something like mult_mod(x, y, m) that would be alright with me. In fact, I think something like that would be the best, and cause the least confusion and language change.

In the example for (b * b) % m, b and m are the same type (which generically can be any Int Type), but per your questions let's say they are UInt128s. Then a double-precision multiplication would fit into the equivalent of a UInt256, and then accurately reduced into a UInt128. (In most of my numeric algorithms all the values are inherently UInts and not signed values.)

For the needs of this algorithm, I always set b the same type as m. In a more generic version, say mult_mod(x, y, m) = (x * y) % m, (x * y) would use the largest int type to do the multiplication in, whose result is then reduced into the int type for m. As long as you get the full accurate value of the multiplication it can be reduced accurately modulo any int type.

I hope that answers the questions.

jzakiya commented 1 year ago

So using the suggested semantics I could write powmodopt like this, if I want to differentiate between 3 int types.

def powmodint(b, e, m)
  r = typeof(m).new(1)
  b = b % m;  b = typeof(m).new(b)
  while e > 0
    r = multmod(r,b,m) if e.odd?
    e >>= 1
    b = multmod(b,b,m)
  end
  r
end

def powmodopt(b, e, m)
  if m <= UInt32::MAX
    m = m.to_u32
    r = powmodint(b, e, m)
  elsif m <= UInt64::MAX
    m = m.to_u64
    r = powmodint(b, e, m)
  elsif m <= UInt128::MAX
    m = m.to_u128
    r = powmodint(b, e, m)
  else
    r = powmodgmp(b, e, m)
  end
end
straight-shoota commented 1 year ago

I think an integrated method would be preferrable for this. The semantics of Int#% would result in the type of the entire expression to be the same as that of the *% call: typeof((x *% y) % m) == typeof(x *% y). The latter is double-precision as an intermediary step and leaking that type out would be unintended. A single method can entirely encapsulate the internal implementation.

jzakiya commented 1 year ago

At the bottom of this wikipedia article on modular arithmetic it gives examples in C for doing fast modular multiplication. It shows algorithms for using UInt64s but can probably be also used for UInt128. Maybe it will helpful in conceptualizing an approach conducive to using LLVM.

https://en.wikipedia.org/wiki/Modular_arithmetic

jzakiya commented 1 year ago

I just did a quick hack to see if conceptually using an effective dbl-prec multiply gives right answers, it does. But here, u128 mults are much slower, :-( -- but the answers are all accurate :-). So effectively doing the equivalent of the mults fast and accurate is the essential piece here.

Here's the hack code.

def mulmod(x, y, m)
  t = x.to_u128 * y
  r = typeof(m).new(t % m)
end

def powmodint(b, e, m)
  r = typeof(m).new(1)
  b = b % m;  b = typeof(m).new(b)
  while e > 0
    r = mulmod(r, b, m) if e.odd?
    e >>= 1
    b = mulmod(b, b, m)
  end
  r
end

Here are the tests results.

r = 107323 of class UInt64 using powmodopt
r = 107323 of class BigInt using powmodgmp

powmodopt   5.81M (172.25ns) (± 0.80%)  16.0B/op        fastest
powmodgmp   3.72M (268.56ns) (± 1.19%)  32.0B/op   1.56× slower

r = 107323 of class UInt64 using powmodopt
r = 107323 of class BigInt using powmodgmp

powmodopt   6.09M (164.20ns) (± 0.99%)  16.0B/op        fastest
powmodgmp   3.61M (277.28ns) (± 0.74%)  48.0B/op   1.69× slower

r = 107323 of class UInt64 using powmodopt
r = 107323 of class BigInt using powmodgmp

powmodopt   6.10M (163.96ns) (± 0.82%)  16.0B/op        fastest
powmodgmp   3.61M (277.02ns) (± 0.69%)  48.0B/op   1.69× slower

r = 31565 of class UInt64 using powmodopt
r = 31565 of class BigInt using powmodgmp

powmodopt   1.04M (961.61ns) (± 0.31%)  1.05kB/op   2.15× slower
powmodgmp   2.24M (446.30ns) (± 0.36%)   16.0B/op        fastest

r = 31565 of class UInt64 using powmodopt
r = 31565 of class BigInt using powmodgmp

powmodopt   3.08M (324.58ns) (± 0.32%)  16.0B/op        fastest
powmodgmp   2.17M (461.70ns) (± 0.31%)  48.0B/op   1.42× slower

r = 31565 of class UInt64 using powmodopt
r = 31565 of class BigInt using powmodgmp

powmodopt   3.11M (322.05ns) (± 0.31%)  16.0B/op        fastest
powmodgmp   2.15M (464.39ns) (± 0.36%)  48.0B/op   1.44× slower

r = 1898779714053406193 of class UInt64 using powmodopt
r = 1898779714053406193 of class BigInt using powmodgmp

powmodopt 802.93k (  1.25µs) (± 0.22%)  16.0B/op   2.09× slower
powmodgmp   1.67M (597.10ns) (± 0.41%)  32.0B/op        fastest

r = 1898779714053406193 of class UInt64 using powmodopt
r = 1898779714053406193 of class BigInt using powmodgmp

powmodopt 806.78k (  1.24µs) (± 0.25%)  16.0B/op   2.04× slower
powmodgmp   1.64M (608.75ns) (± 0.28%)  48.0B/op        fastest

r = 1898779714053406193 of class UInt64 using powmodopt
r = 1898779714053406193 of class BigInt using powmodgmp

powmodopt 806.30k (  1.24µs) (± 0.20%)  16.0B/op   2.04× slower
powmodgmp   1.65M (607.63ns) (± 0.39%)  48.0B/op        fastest
jzakiya commented 1 year ago

This optimizes for when both x|y fit within 32-bits.

def mulmod(x, y, m)
  return (x * y) % m if ((x | y) >> 32).zero?
  t = x.to_u128 * y
  r = typeof(m).new(t % m)
end
jzakiya commented 1 year ago

This code won't compile because get no to_u128 method for BigInts error.

def powmod(b, e, m)
  if m > UInt64::MAX
    r = powmodgmp(b, e, m)
  else
    n = m.to_u128
    r = powmodint(b, e, n)
  end
end

But doing some experimenting, I found this code compiles and runs correctly.

def powmod(b, e, m)
  if m > UInt64::MAX
    r = powmodgmp(b, e, m)
  else
    n = 1u128 * m
    r = powmodint(b, e, n)
  end
end

When I switch multiplication order I was expecting an error, but this code compiles but hangs on certain values for m.

def powmod(b, e, m)
  if m > UInt64::MAX
    r = powmodgmp(b, e, m)
  else
    n = m * 1.to_u128 
    r = powmodint(b, e, n)
  end
end

Doing the math for 32-bit values converted to 128-bits is slower than using 64-bits, but faster for values > 64-bits and < BigInts. So (currently) something like this might be the best (I haven't tried to see if it compiles and runs yet though).

def powmod(b, e, m)
  if m <= UInt32::MAX
    n = m.to_u64
    r = powmodint(b, e, n)
  elsif m <= UInt64::MAX
    n = 1u128 * m
    r = powmodint(b, e, n)
  else
    r = powmodgmp(b, e, m)
  end
end

ADDED: The code compiles and runs but it's overall a bit slower doing the u128 multiplies vs using u64s, at least on my 64-bit AMD cpu. Now if I had a 128-bit cpu.... :-)