Open jzakiya opened 1 year ago
Does it have to be a new operator? Can't it be a regular method?
Also a naive implementation would require one to also implement Int256#%(Int128)
. Int256
doesn't even exist.
When I say operator
I was referring to its use not its implementation details.
So I would expect it to be a method, just as +
, /
, *
, and -
are.
I don't understand the second question about Int256#(Int128)
.
There are currently no Int256
, as you say, so that could never come into the equation currently.
Unless I'm missing something in your question. If I am can you give me an example.
When I say operator I was referring to its use not its implementation details.
But does it have to have an operator name? Alphanumeric names are preferable because they don't require modifications to Crystal's syntax.
To prevent this (b *% b) % m would first perform a double-precision multiply, then the modulo reduction, to preserve the full accuracy of the lost precision.
That means if b
and m
are Int128
s, then Int128#*%(Int128)
should return an Int256
, and the result of that should call Int256#%(Int128)
to return an Int128
. Isn't that what you mean?
If you want to use something like mult_mod(x, y, m)
that would be alright with me.
In fact, I think something like that would be the best, and cause the least confusion and language change.
In the example for (b * b) % m
, b
and m
are the same type (which generically can be any Int Type
), but per your questions let's say they are UInt128
s. Then a double-precision multiplication would fit into the equivalent of a UInt256
, and then accurately reduced into a UInt128
. (In most of my numeric algorithms all the values are inherently UInts
and not signed values.)
For the needs of this algorithm, I always set b
the same type as m
.
In a more generic version, say mult_mod(x, y, m) = (x * y) % m
, (x * y)
would use the largest int type to do the multiplication in, whose result is then reduced into the int type for m
. As long as you get the full accurate value of the multiplication it can be reduced accurately modulo any int type.
I hope that answers the questions.
So using the suggested semantics I could write powmodopt
like this, if I want to differentiate between 3 int types.
def powmodint(b, e, m)
r = typeof(m).new(1)
b = b % m; b = typeof(m).new(b)
while e > 0
r = multmod(r,b,m) if e.odd?
e >>= 1
b = multmod(b,b,m)
end
r
end
def powmodopt(b, e, m)
if m <= UInt32::MAX
m = m.to_u32
r = powmodint(b, e, m)
elsif m <= UInt64::MAX
m = m.to_u64
r = powmodint(b, e, m)
elsif m <= UInt128::MAX
m = m.to_u128
r = powmodint(b, e, m)
else
r = powmodgmp(b, e, m)
end
end
I think an integrated method would be preferrable for this. The semantics of Int#%
would result in the type of the entire expression to be the same as that of the *%
call: typeof((x *% y) % m) == typeof(x *% y)
. The latter is double-precision as an intermediary step and leaking that type out would be unintended.
A single method can entirely encapsulate the internal implementation.
At the bottom of this wikipedia
article on modular arithmetic
it gives examples in C
for doing fast modular multiplication.
It shows algorithms for using UInt64
s but can probably be also used for UInt128
. Maybe it will helpful in conceptualizing an approach conducive to using LLVM.
I just did a quick hack to see if conceptually using an effective dbl-prec multiply gives right answers, it does.
But here, u128
mults are much slower, :-( -- but the answers are all accurate :-).
So effectively doing the equivalent of the mults fast and accurate is the essential piece here.
Here's the hack code.
def mulmod(x, y, m)
t = x.to_u128 * y
r = typeof(m).new(t % m)
end
def powmodint(b, e, m)
r = typeof(m).new(1)
b = b % m; b = typeof(m).new(b)
while e > 0
r = mulmod(r, b, m) if e.odd?
e >>= 1
b = mulmod(b, b, m)
end
r
end
Here are the tests results.
r = 107323 of class UInt64 using powmodopt
r = 107323 of class BigInt using powmodgmp
powmodopt 5.81M (172.25ns) (± 0.80%) 16.0B/op fastest
powmodgmp 3.72M (268.56ns) (± 1.19%) 32.0B/op 1.56× slower
r = 107323 of class UInt64 using powmodopt
r = 107323 of class BigInt using powmodgmp
powmodopt 6.09M (164.20ns) (± 0.99%) 16.0B/op fastest
powmodgmp 3.61M (277.28ns) (± 0.74%) 48.0B/op 1.69× slower
r = 107323 of class UInt64 using powmodopt
r = 107323 of class BigInt using powmodgmp
powmodopt 6.10M (163.96ns) (± 0.82%) 16.0B/op fastest
powmodgmp 3.61M (277.02ns) (± 0.69%) 48.0B/op 1.69× slower
r = 31565 of class UInt64 using powmodopt
r = 31565 of class BigInt using powmodgmp
powmodopt 1.04M (961.61ns) (± 0.31%) 1.05kB/op 2.15× slower
powmodgmp 2.24M (446.30ns) (± 0.36%) 16.0B/op fastest
r = 31565 of class UInt64 using powmodopt
r = 31565 of class BigInt using powmodgmp
powmodopt 3.08M (324.58ns) (± 0.32%) 16.0B/op fastest
powmodgmp 2.17M (461.70ns) (± 0.31%) 48.0B/op 1.42× slower
r = 31565 of class UInt64 using powmodopt
r = 31565 of class BigInt using powmodgmp
powmodopt 3.11M (322.05ns) (± 0.31%) 16.0B/op fastest
powmodgmp 2.15M (464.39ns) (± 0.36%) 48.0B/op 1.44× slower
r = 1898779714053406193 of class UInt64 using powmodopt
r = 1898779714053406193 of class BigInt using powmodgmp
powmodopt 802.93k ( 1.25µs) (± 0.22%) 16.0B/op 2.09× slower
powmodgmp 1.67M (597.10ns) (± 0.41%) 32.0B/op fastest
r = 1898779714053406193 of class UInt64 using powmodopt
r = 1898779714053406193 of class BigInt using powmodgmp
powmodopt 806.78k ( 1.24µs) (± 0.25%) 16.0B/op 2.04× slower
powmodgmp 1.64M (608.75ns) (± 0.28%) 48.0B/op fastest
r = 1898779714053406193 of class UInt64 using powmodopt
r = 1898779714053406193 of class BigInt using powmodgmp
powmodopt 806.30k ( 1.24µs) (± 0.20%) 16.0B/op 2.04× slower
powmodgmp 1.65M (607.63ns) (± 0.39%) 48.0B/op fastest
This optimizes for when both x|y
fit within 32-bits.
def mulmod(x, y, m)
return (x * y) % m if ((x | y) >> 32).zero?
t = x.to_u128 * y
r = typeof(m).new(t % m)
end
This code won't compile because get no to_u128 method for BigInts
error.
def powmod(b, e, m)
if m > UInt64::MAX
r = powmodgmp(b, e, m)
else
n = m.to_u128
r = powmodint(b, e, n)
end
end
But doing some experimenting, I found this code compiles and runs correctly.
def powmod(b, e, m)
if m > UInt64::MAX
r = powmodgmp(b, e, m)
else
n = 1u128 * m
r = powmodint(b, e, n)
end
end
When I switch multiplication order I was expecting an error, but this code compiles but hangs on certain values for m.
def powmod(b, e, m)
if m > UInt64::MAX
r = powmodgmp(b, e, m)
else
n = m * 1.to_u128
r = powmodint(b, e, n)
end
end
Doing the math for 32-bit values converted to 128-bits is slower than using 64-bits, but faster for values > 64-bits and < BigInts. So (currently) something like this might be the best (I haven't tried to see if it compiles and runs yet though).
def powmod(b, e, m)
if m <= UInt32::MAX
n = m.to_u64
r = powmodint(b, e, n)
elsif m <= UInt64::MAX
n = 1u128 * m
r = powmodint(b, e, n)
else
r = powmodgmp(b, e, m)
end
end
ADDED:
The code compiles and runs but it's overall a bit slower doing the u128
multiplies vs using u64
s,
at least on my 64-bit AMD cpu. Now if I had a 128-bit cpu.... :-)
Proposal
I'm proposing adding the operation
*%
to be used as(x *% y) % m
.Purpose
Here,
(x *% y)
will be internally performed as a double-precision multiplication to temporarily use twice the number of bytes for a givenInt
size to hold the multiplication result before being reduced modulom
for its size.This will allow all
Init::Primitive
type modulo math to be done accurately, which now doesn't occur if the(x * y)
multiplication overflows (i.e. doesn't fit within the size ofm
).Rationale
Crystal added the
&
operand to be used as&*
and&**
to prevent certain overflows. However when doing modular-muliplications(x &* y) % m
the results will still silently overflow if the resultant multiplication value is greater than thex|y
types, causing an incorrect result, while producing no runtime error or warning.To prevent this, the
(x * y)
operation must be done as a double-precision multiplication, whose result is temporarily stored as twice the size of the largestx|y
int type. Then the modulo reduction is done tom
's type size the resultant value will be accurate.Currently if one performs the following code,
(b &* b)
will run and silently overflow if the result is greater thanb
s int size.To prevent this
(b *% b) % m
would first perform a double-precision multiply, then the modulo reduction, to preserve the full accuracy of the lost precision.Use Cases
Using math for a given int type is faster than using
BigInt
s. Thus, it's always better to prevent usingBigInt
s as much as possible if highest performance is desired.The following example shows where this proposal would make a great difference.
This code uses the
GMP
lib function to perform apowmod
.This works with any size int type, by first converting them all to
BigInt
s to use inGMP
function.However when the inputs are not
BigInts
the Crystalpowmod
code is faster, but because of the cited issues, will not produce accurate results when the values forb
cause overflows. To get the best of both worlds we can create a hybridpowmod
function.In the desired code the
&*
operations would be replace by*%
(or equivalent).Here
powmodint
works when the value ofm
is not (less than) aBigInt
, no matter the type ofb
.In
powmodopt
, if the value ofm
is greater than the largest int type it's aBigInt
, so we use theGMP
version.Otherwise we use
powmodint
, which first reduces the value ofb
to fit inm
and then assigns its type asm
s. Nowb
andm
values are of the same type. Thus doing the subsequent math is the fastest possible.(Because currently, even though a
UInt128
is the largest int type, you can't doBigInt.to_(i|u)128
, so the code usesif m > UInt64::MAX
to check the value ofm
. See https://github.com/crystal-lang/crystal/issues/12771)Below are some benchmarks showing the performance for different value cases. Using the
GMP
version always produces accurate results. The accuracy usingpowmodopt
varies based on the overflow issue but performance still stands.And here are the performance and accuracy results. Except in one case (for some reason)
powmodopt
is generally3x
faster.Because these types of operations are generally used inside loops for many algorithms you can see the potential overall algorithmic performance increase with this feature.