Open jzakiya opened 1 year ago
Something weird is going on that I haven't tracked down why yet.
Below is a test suite for the method that shows the performance difference in my codebase.
If you put it in a file and run as: crystal run --release <filename.cr>
the benchmarks show no performance difference for this method, but when run in the entire code it does.
Let me see if I can figure out what's going on, because it's not a slight difference in the codebase.
require "big"
require "benchmark"
@[Link("gmp")]
lib LibGMP
fun mpz_powm = __gmpz_powm(rop : MPZ*, base : MPZ*, exp : MPZ*, mod : MPZ*)
end
def powmodgmp(b, e, m)
y = BigInt.new
LibGMP.mpz_powm(y, b.to_big_i, e.to_big_i, m.to_big_i)
y
end
def powmodint(b, e, m)
r = typeof(m).new(1)
b = b % m; b = typeof(m).new(b)
while e > 0
r = (r &* b) % m if e.odd?
e >>= 1
b = (b &* b) % m
end
r
end
def powmod(b, e, m)
if m > UInt32::MAX
r = powmodgmp(b, e, m)
else
n = m.to_u64
r = powmodint(b, e, n)
end
end
module Primes
module Utils
def prime? (k = 5)
primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103]
return primes.includes? self if self <= primes.last
modpn = 232862364358497360900063316880507363070u128.to_big_i
return false if modpn.gcd(self.to_big_i) != 1
return true if self < primes.last**2
# Start Miller-Rabin primality test
neg_one_mod = n = d = self - 1 # these are even as self is always odd
d >>= d.trailing_zeros_count # shift out factors of 2 to make d odd
witnesses = k.times.map{ rand(self - 4) + 2 }
witnesses.each do |b| # perform Miller-Rabin bases tests
y = powmod(b, d, self) # y = (b**d) mod self
s = d # s is odd d here
until y == 1 || y == neg_one_mod || s == n
y = (y &* y) % self # y = (y**2) mod self
s <<= 1
end
return false unless y == neg_one_mod || s.odd?
end
true
end
def next_prime1
n = self > UInt128::MAX - self ? self.to_big_i : typeof(1u128).new(self.to_s.to_u128)
return (n >> 1) + 2 if n <= 2
n = n + 1 | 1
until (res = n % 6) & 0b11 == 1; n += 2 end
inc = (res == 1) ? 4 : 2
until n.prime?; n += inc; inc ^= 0b110; end
n
end
def next_prime2
n = self > UInt128::MAX - self ? self.to_big_i : typeof(1u128).new(self.to_u128)
return (n >> 1) + 2 if n <= 2
n = n + 1 | 1
until (res = n % 6) & 0b11 == 1; n += 2 end
inc = (res == 1) ? 4 : 2
until n.prime?; n += inc; inc ^= 0b110; end
n
end
end
end
struct Int; include Primes::Utils end
def test(n)
Benchmark.ips do |x|
x.report("next_prime1") { n.next_prime1 }
x.report("next_prime2") { n.next_prime2 }
puts
end
end
n = 303_371_455_241
test n
n = 8446744073709551627
test n
n = 8446744073709551556
test n
n = 18446744073709551556u64
test n
n = 18446744073709551557u64
test n
n = "18446744073709551629".to_big_i
test n
n = 100_000_000_000_000
test n
n = 1_134_250_000_000_000_000
test n
n = "7000768250000000000000".to_big_i
test n
Here were outputs on my system.
next_prime1 23.08k ( 43.33µs) (± 0.99%) 48.3kB/op 1.00× slower
next_prime2 23.08k ( 43.32µs) (± 0.21%) 48.3kB/op fastest
next_prime1 201.25k ( 4.97µs) (± 0.26%) 3.22kB/op 1.01× slower
next_prime2 202.92k ( 4.93µs) (± 0.29%) 3.17kB/op fastest
next_prime1 214.01k ( 4.67µs) (± 0.31%) 2.47kB/op 1.02× slower
next_prime2 217.42k ( 4.60µs) (± 0.15%) 2.42kB/op fastest
next_prime1 419.50k ( 2.38µs) (± 0.93%) 745B/op 1.03× slower
next_prime2 432.64k ( 2.31µs) (± 0.25%) 696B/op fastest
next_prime1 61.90k ( 16.15µs) (± 0.17%) 16.0kB/op 1.00× slower
next_prime2 62.07k ( 16.11µs) (± 0.15%) 15.9kB/op fastest
next_prime1 143.25k ( 6.98µs) (± 0.24%) 3.69kB/op 1.03× slower
next_prime2 147.22k ( 6.79µs) (± 0.21%) 3.64kB/op fastest
next_prime1 281.57k ( 3.55µs) (± 0.20%) 3.11kB/op 1.02× slower
next_prime2 286.52k ( 3.49µs) (± 0.17%) 3.08kB/op fastest
next_prime1 199.42k ( 5.01µs) (± 0.16%) 3.31kB/op 1.01× slower
next_prime2 202.11k ( 4.95µs) (± 0.18%) 3.26kB/op fastest
next_prime1 185.87k ( 5.38µs) (± 0.19%) 2.67kB/op 1.03× slower
next_prime2 191.75k ( 5.22µs) (± 0.22%) 2.62kB/op fastest
OK, I found the culprit.
In the code for prime?
I originally had modpn
written as this:
modpn = 232862364358497360900063316880507363070u128.to_big_i
Then I was wondering (couldn't remember) why I didn't write it as this"
modpn = "232862364358497360900063316880507363070".to_big_i
So I changed it, but didn't benchmark it before I made the other change. But this is the culprit, and this one little change makes a humongous performance regression. Then I remembered why I wrote it the other way!!
This should be documented, because this is an subtle invisible performance killer that you never would know existed if you hadn't tried it both ways.
To see the difference run this code:
require "big"
require "benchmark"
@[Link("gmp")]
lib LibGMP
fun mpz_powm = __gmpz_powm(rop : MPZ*, base : MPZ*, exp : MPZ*, mod : MPZ*)
end
def powmodgmp(b, e, m)
y = BigInt.new
LibGMP.mpz_powm(y, b.to_big_i, e.to_big_i, m.to_big_i)
y
end
def powmodint(b, e, m)
r = typeof(m).new(1)
b = b % m; b = typeof(m).new(b)
while e > 0
r = (r &* b) % m if e.odd?
e >>= 1
b = (b &* b) % m
end
r
end
def powmod(b, e, m)
if m > UInt32::MAX
r = powmodgmp(b, e, m)
else
n = m.to_u64
r = powmodint(b, e, n)
end
end
module Primes
module Utils
def prime1? (k = 5)
primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103]
return primes.includes? self if self <= primes.last
modpn = 232862364358497360900063316880507363070u128.to_big_i
return false if modpn.gcd(self.to_big_i) != 1
return true if self < primes.last**2
# Start Miller-Rabin primality test
neg_one_mod = n = d = self - 1 # these are even as self is always odd
d >>= d.trailing_zeros_count # shift out factors of 2 to make d odd
witnesses = k.times.map{ rand(self - 4) + 2 }
witnesses.each do |b| # perform Miller-Rabin bases tests
y = powmod(b, d, self) # y = (b**d) mod self
s = d # s is odd d here
until y == 1 || y == neg_one_mod || s == n
y = (y &* y) % self # y = (y**2) mod self
s <<= 1
end
return false unless y == neg_one_mod || s.odd?
end
true
end
def prime2? (k = 5)
primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103]
return primes.includes? self if self <= primes.last
modpn = "232862364358497360900063316880507363070".to_big_i
return false if modpn.gcd(self.to_big_i) != 1
return true if self < primes.last**2
# Start Miller-Rabin primality test
neg_one_mod = n = d = self - 1 # these are even as self is always odd
d >>= d.trailing_zeros_count # shift out factors of 2 to make d odd
witnesses = k.times.map{ rand(self - 4) + 2 }
witnesses.each do |b| # perform Miller-Rabin bases tests
y = powmod(b, d, self) # y = (b**d) mod self
s = d # s is odd d here
until y == 1 || y == neg_one_mod || s == n
y = (y &* y) % self # y = (y**2) mod self
s <<= 1
end
return false unless y == neg_one_mod || s.odd?
end
true
end
def next_prime1
n = self > UInt128::MAX - self ? self.to_big_i : typeof(1u128).new(self.to_u128)
return (n >> 1) + 2 if n <= 2
n = n + 1 | 1
until (res = n % 6) & 0b11 == 1; n += 2 end
inc = (res == 1) ? 4 : 2
until n.prime1?; n += inc; inc ^= 0b110; end
n
end
def next_prime2
n = self > UInt128::MAX - self ? self.to_big_i : typeof(1u128).new(self.to_u128)
return (n >> 1) + 2 if n <= 2
n = n + 1 | 1
until (res = n % 6) & 0b11 == 1; n += 2 end
inc = (res == 1) ? 4 : 2
until n.prime2?; n += inc; inc ^= 0b110; end
n
end
end
end
struct Int; include Primes::Utils end
def test(n)
Benchmark.ips do |x|
x.report("next_prime1") { n.next_prime1 }
x.report("next_prime2") { n.next_prime2 }
puts
end
end
n = 303_371_455_241
test n
n = 8446744073709551627
test n
n = 8446744073709551556
test n
n = 18446744073709551556u64
test n
n = 18446744073709551557u64
test n
n = "18446744073709551629".to_big_i
test n
n = 100_000_000_000_000
test n
n = 1_134_250_000_000_000_000
test n
n = "7000768250000000000000".to_big_i
test n
Here are the results.
next_prime1 23.84k ( 41.94µs) (± 0.36%) 48.3kB/op fastest
next_prime2 14.01k ( 71.37µs) (± 0.48%) 77.0kB/op 1.70× slower
next_prime1 208.45k ( 4.80µs) (± 0.33%) 3.16kB/op fastest
next_prime2 155.85k ( 6.42µs) (± 0.13%) 4.72kB/op 1.34× slower
next_prime1 220.73k ( 4.53µs) (± 0.18%) 2.42kB/op fastest
next_prime2 183.97k ( 5.44µs) (± 0.10%) 3.28kB/op 1.20× slower
next_prime1 435.19k ( 2.30µs) (± 0.05%) 696B/op fastest
next_prime2 401.64k ( 2.49µs) (± 0.12%) 872B/op 1.08× slower
next_prime1 62.89k ( 15.90µs) (± 0.15%) 15.9kB/op fastest
next_prime2 49.47k ( 20.21µs) (± 0.18%) 20.0kB/op 1.27× slower
next_prime1 149.40k ( 6.69µs) (± 0.18%) 3.65kB/op fastest
next_prime2 122.91k ( 8.14µs) (± 0.21%) 5.01kB/op 1.22× slower
next_prime1 295.98k ( 3.38µs) (± 0.07%) 3.08kB/op fastest
next_prime2 190.16k ( 5.26µs) (± 0.30%) 4.97kB/op 1.56× slower
next_prime1 205.67k ( 4.86µs) (± 0.28%) 3.27kB/op fastest
next_prime2 159.23k ( 6.28µs) (± 0.25%) 4.64kB/op 1.29× slower
next_prime1 198.48k ( 5.04µs) (± 0.17%) 2.62kB/op fastest
next_prime2 178.48k ( 5.60µs) (± 0.10%) 3.13kB/op 1.11× slower
But I use prime?
all over the place in my codebase, and when it's used inside loops in other methods, I saw a 3+ second regression in some other methods for certain inputs.
It pays to have to a good test suite that sufficiently covers your use cases!
Here's the ultimate "fix" to this issue|problem
What the test results showed is that modpn
is being recalculated at runtime
whenever prime?
is used.
So what I want is for it to be made CONSTANT
during compilation.
In Rust
, et al, I can define constants inside the method, but Crystal
doesn't allow that (hopefully in 2.0 it will), so I had to do it external to its use. So in the test code I did the following:
module Primes
module Utils
MODPN1 = 232862364358497360900063316880507363070u128.to_big_i
MODPN2 = "232862364358497360900063316880507363070".to_big_i
Then inside each version used as: return false if MODPN[1|2].gcd(self.to_big_i) != 1
This produces these much faster (more operations per time period), and equal results.
next_prime1 24.90k ( 40.16µs) (± 0.30%) 43.1kB/op 1.01× slower
next_prime2 25.05k ( 39.92µs) (± 0.21%) 43.1kB/op fastest
next_prime1 212.58k ( 4.70µs) (± 0.15%) 2.88kB/op fastest
next_prime2 212.53k ( 4.71µs) (± 0.11%) 2.88kB/op 1.00× slower
next_prime1 223.79k ( 4.47µs) (± 0.15%) 2.27kB/op fastest
next_prime2 222.61k ( 4.49µs) (± 0.20%) 2.27kB/op 1.01× slower
next_prime1 438.20k ( 2.28µs) (± 0.05%) 664B/op fastest
next_prime2 436.85k ( 2.29µs) (± 0.07%) 664B/op 1.00× slower
next_prime1 65.07k ( 15.37µs) (± 0.07%) 15.2kB/op 1.00× slower
next_prime2 65.34k ( 15.30µs) (± 0.08%) 15.2kB/op fastest
next_prime1 153.43k ( 6.52µs) (± 0.07%) 3.39kB/op 1.01× slower
next_prime2 154.23k ( 6.48µs) (± 0.10%) 3.39kB/op fastest
next_prime1 309.33k ( 3.23µs) (± 0.07%) 2.73kB/op fastest
next_prime2 308.97k ( 3.24µs) (± 0.08%) 2.73kB/op 1.00× slower
next_prime1 210.96k ( 4.74µs) (± 0.10%) 3.01kB/op fastest
next_prime2 210.67k ( 4.75µs) (± 0.15%) 3.01kB/op 1.00× slower
next_prime1 198.40k ( 5.04µs) (± 0.12%) 2.52kB/op 1.01× slower
next_prime2 199.47k ( 5.01µs) (± 0.09%) 2.52kB/op fastest
Again, this kind of performance enhancing|degrading information needs to be documented.
For people coming from Ruby
, etc, with no experience writing optimum performant code for compiled languages in general, and Crystal
specifically, may run into performance issues and think Crystal
is slower than some other language, when it was them not knowing the tricks
to writing optimum performant Crystal
.
So maybe the title of this issue should be change to something like: Performance enhancing issues
so when people do searches it may catch their attention better.
Crystal 1.9.x
Before having
to_u128
to work onBigInt
s I did this:After making
to_u128
work withBinInt
s in 1.9.x I then used this.Benchmarking 1.9.x in my large codebase shows the former is demonstratively faster.