Performance differences using to_u128

Crystal 1.9.x

Before having to_u128 to work on BigInts I did this:

typeof(1u128).new(self.to_s.to_u128)

After making to_u128 work with BinInts in 1.9.x I then used this.

typeof(1u128).new(self.to_u128)

Benchmarking 1.9.x in my large codebase shows the former is demonstratively faster.

Something weird is going on that I haven't tracked down why yet.

Below is a test suite for the method that shows the performance difference in my codebase. If you put it in a file and run as: crystal run --release <filename.cr> the benchmarks show no performance difference for this method, but when run in the entire code it does. Let me see if I can figure out what's going on, because it's not a slight difference in the codebase.

require "big"
require "benchmark"

@[Link("gmp")]
lib LibGMP
  fun mpz_powm = __gmpz_powm(rop : MPZ*, base : MPZ*, exp : MPZ*, mod : MPZ*)
end

def powmodgmp(b, e, m)
  y = BigInt.new
  LibGMP.mpz_powm(y, b.to_big_i, e.to_big_i, m.to_big_i)
  y
end

def powmodint(b, e, m)
  r = typeof(m).new(1)
  b = b % m;  b = typeof(m).new(b)
  while e > 0
    r = (r &* b) % m if e.odd?
    e >>= 1
    b = (b &* b) % m
  end
  r
end

def powmod(b, e, m)
  if m > UInt32::MAX
    r = powmodgmp(b, e, m)
  else
    n = m.to_u64
    r = powmodint(b, e, n)
  end
end

module Primes
  module Utils
    def prime? (k = 5)
      primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103]
      return primes.includes? self if self <= primes.last
      modpn = 232862364358497360900063316880507363070u128.to_big_i
      return false if modpn.gcd(self.to_big_i) != 1
      return true  if self < primes.last**2
      # Start Miller-Rabin primality test
      neg_one_mod = n = d = self - 1 # these are even as self is always odd
      d >>= d.trailing_zeros_count   # shift out factors of 2 to make d odd
      witnesses = k.times.map{ rand(self - 4) + 2 }
      witnesses.each do |b|          # perform Miller-Rabin bases tests
        y = powmod(b, d, self)       # y = (b**d) mod self
        s = d                        # s is odd d here
        until y == 1 || y == neg_one_mod || s == n
          y = (y &* y) % self        # y = (y**2) mod self
          s <<= 1
        end
        return false unless y == neg_one_mod || s.odd?
      end
      true
    end

    def next_prime1
      n = self > UInt128::MAX - self ? self.to_big_i : typeof(1u128).new(self.to_s.to_u128)
      return (n >> 1) + 2 if n <= 2
      n = n + 1 | 1
      until (res = n % 6) & 0b11 == 1; n += 2 end
      inc = (res == 1) ? 4 : 2
      until n.prime?; n += inc; inc ^= 0b110; end
      n
    end

    def next_prime2
      n = self > UInt128::MAX - self ? self.to_big_i : typeof(1u128).new(self.to_u128)
      return (n >> 1) + 2 if n <= 2
      n = n + 1 | 1
      until (res = n % 6) & 0b11 == 1; n += 2 end
      inc = (res == 1) ? 4 : 2
      until n.prime?; n += inc; inc ^= 0b110; end
      n
    end
  end
end
struct Int; include Primes::Utils end

def test(n)
  Benchmark.ips do |x|
    x.report("next_prime1") { n.next_prime1 }
    x.report("next_prime2") { n.next_prime2 }
    puts
  end
end

n = 303_371_455_241
test n

n = 8446744073709551627
test n

n = 8446744073709551556
test n

n = 18446744073709551556u64
test n

n = 18446744073709551557u64
test n

n = "18446744073709551629".to_big_i
test n

n = 100_000_000_000_000
test n

n = 1_134_250_000_000_000_000
test n

n = "7000768250000000000000".to_big_i
test n

Here were outputs on my system.

next_prime1  23.08k ( 43.33µs) (± 0.99%)  48.3kB/op   1.00× slower
next_prime2  23.08k ( 43.32µs) (± 0.21%)  48.3kB/op        fastest

next_prime1 201.25k (  4.97µs) (± 0.26%)  3.22kB/op   1.01× slower
next_prime2 202.92k (  4.93µs) (± 0.29%)  3.17kB/op        fastest

next_prime1 214.01k (  4.67µs) (± 0.31%)  2.47kB/op   1.02× slower
next_prime2 217.42k (  4.60µs) (± 0.15%)  2.42kB/op        fastest

next_prime1 419.50k (  2.38µs) (± 0.93%)  745B/op   1.03× slower
next_prime2 432.64k (  2.31µs) (± 0.25%)  696B/op        fastest

next_prime1  61.90k ( 16.15µs) (± 0.17%)  16.0kB/op   1.00× slower
next_prime2  62.07k ( 16.11µs) (± 0.15%)  15.9kB/op        fastest

next_prime1 143.25k (  6.98µs) (± 0.24%)  3.69kB/op   1.03× slower
next_prime2 147.22k (  6.79µs) (± 0.21%)  3.64kB/op        fastest

next_prime1 281.57k (  3.55µs) (± 0.20%)  3.11kB/op   1.02× slower
next_prime2 286.52k (  3.49µs) (± 0.17%)  3.08kB/op        fastest

next_prime1 199.42k (  5.01µs) (± 0.16%)  3.31kB/op   1.01× slower
next_prime2 202.11k (  4.95µs) (± 0.18%)  3.26kB/op        fastest

next_prime1 185.87k (  5.38µs) (± 0.19%)  2.67kB/op   1.03× slower
next_prime2 191.75k (  5.22µs) (± 0.22%)  2.62kB/op        fastest

OK, I found the culprit.

In the code for prime? I originally had modpn written as this:

modpn = 232862364358497360900063316880507363070u128.to_big_i

Then I was wondering (couldn't remember) why I didn't write it as this"

modpn = "232862364358497360900063316880507363070".to_big_i

So I changed it, but didn't benchmark it before I made the other change. But this is the culprit, and this one little change makes a humongous performance regression. Then I remembered why I wrote it the other way!!

This should be documented, because this is an subtle invisible performance killer that you never would know existed if you hadn't tried it both ways.

To see the difference run this code:

require "big"
require "benchmark"

@[Link("gmp")]
lib LibGMP
  fun mpz_powm = __gmpz_powm(rop : MPZ*, base : MPZ*, exp : MPZ*, mod : MPZ*)
end

def powmodgmp(b, e, m)
  y = BigInt.new
  LibGMP.mpz_powm(y, b.to_big_i, e.to_big_i, m.to_big_i)
  y
end

def powmodint(b, e, m)
  r = typeof(m).new(1)
  b = b % m;  b = typeof(m).new(b)
  while e > 0
    r = (r &* b) % m if e.odd?
    e >>= 1
    b = (b &* b) % m
  end
  r
end

def powmod(b, e, m)
  if m > UInt32::MAX
    r = powmodgmp(b, e, m)
  else
    n = m.to_u64
    r = powmodint(b, e, n)
  end
end

module Primes
  module Utils
    def prime1? (k = 5)
      primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103]
      return primes.includes? self if self <= primes.last
      modpn = 232862364358497360900063316880507363070u128.to_big_i
      return false if modpn.gcd(self.to_big_i) != 1
      return true  if self < primes.last**2
      # Start Miller-Rabin primality test
      neg_one_mod = n = d = self - 1 # these are even as self is always odd
      d >>= d.trailing_zeros_count   # shift out factors of 2 to make d odd
      witnesses = k.times.map{ rand(self - 4) + 2 }
      witnesses.each do |b|          # perform Miller-Rabin bases tests
        y = powmod(b, d, self)       # y = (b**d) mod self
        s = d                        # s is odd d here
        until y == 1 || y == neg_one_mod || s == n
          y = (y &* y) % self        # y = (y**2) mod self
          s <<= 1
        end
        return false unless y == neg_one_mod || s.odd?
      end
      true
    end

def prime2? (k = 5)
      primes = [2,3,5,7,11,13,17,19,23,29,31,37,41,43,47,53,59,61,67,71,73,79,83,89,97,101,103]
      return primes.includes? self if self <= primes.last
      modpn = "232862364358497360900063316880507363070".to_big_i
      return false if modpn.gcd(self.to_big_i) != 1
      return true  if self < primes.last**2
      # Start Miller-Rabin primality test
      neg_one_mod = n = d = self - 1 # these are even as self is always odd
      d >>= d.trailing_zeros_count   # shift out factors of 2 to make d odd
      witnesses = k.times.map{ rand(self - 4) + 2 }
      witnesses.each do |b|          # perform Miller-Rabin bases tests
        y = powmod(b, d, self)       # y = (b**d) mod self
        s = d                        # s is odd d here
        until y == 1 || y == neg_one_mod || s == n
          y = (y &* y) % self        # y = (y**2) mod self
          s <<= 1
        end
        return false unless y == neg_one_mod || s.odd?
      end
      true
    end

    def next_prime1
      n = self > UInt128::MAX - self ? self.to_big_i : typeof(1u128).new(self.to_u128)
      return (n >> 1) + 2 if n <= 2
      n = n + 1 | 1
      until (res = n % 6) & 0b11 == 1; n += 2 end
      inc = (res == 1) ? 4 : 2
      until n.prime1?; n += inc; inc ^= 0b110; end
      n
    end

    def next_prime2
      n = self > UInt128::MAX - self ? self.to_big_i : typeof(1u128).new(self.to_u128)
      return (n >> 1) + 2 if n <= 2
      n = n + 1 | 1
      until (res = n % 6) & 0b11 == 1; n += 2 end
      inc = (res == 1) ? 4 : 2
      until n.prime2?; n += inc; inc ^= 0b110; end
      n
    end
  end
end
struct Int; include Primes::Utils end

def test(n)
  Benchmark.ips do |x|
    x.report("next_prime1") { n.next_prime1 }
    x.report("next_prime2") { n.next_prime2 }
    puts
  end
end

n = 303_371_455_241
test n

n = 8446744073709551627
test n

n = 8446744073709551556
test n

n = 18446744073709551556u64
test n

n = 18446744073709551557u64
test n

n = "18446744073709551629".to_big_i
test n

n = 100_000_000_000_000
test n

n = 1_134_250_000_000_000_000
test n

n = "7000768250000000000000".to_big_i
test n

Here are the results.

next_prime1  23.84k ( 41.94µs) (± 0.36%)  48.3kB/op        fastest
next_prime2  14.01k ( 71.37µs) (± 0.48%)  77.0kB/op   1.70× slower

next_prime1 208.45k (  4.80µs) (± 0.33%)  3.16kB/op        fastest
next_prime2 155.85k (  6.42µs) (± 0.13%)  4.72kB/op   1.34× slower

next_prime1 220.73k (  4.53µs) (± 0.18%)  2.42kB/op        fastest
next_prime2 183.97k (  5.44µs) (± 0.10%)  3.28kB/op   1.20× slower

next_prime1 435.19k (  2.30µs) (± 0.05%)  696B/op        fastest
next_prime2 401.64k (  2.49µs) (± 0.12%)  872B/op   1.08× slower

next_prime1  62.89k ( 15.90µs) (± 0.15%)  15.9kB/op        fastest
next_prime2  49.47k ( 20.21µs) (± 0.18%)  20.0kB/op   1.27× slower

next_prime1 149.40k (  6.69µs) (± 0.18%)  3.65kB/op        fastest
next_prime2 122.91k (  8.14µs) (± 0.21%)  5.01kB/op   1.22× slower

next_prime1 295.98k (  3.38µs) (± 0.07%)  3.08kB/op        fastest
next_prime2 190.16k (  5.26µs) (± 0.30%)  4.97kB/op   1.56× slower

next_prime1 205.67k (  4.86µs) (± 0.28%)  3.27kB/op        fastest
next_prime2 159.23k (  6.28µs) (± 0.25%)  4.64kB/op   1.29× slower

next_prime1 198.48k (  5.04µs) (± 0.17%)  2.62kB/op        fastest
next_prime2 178.48k (  5.60µs) (± 0.10%)  3.13kB/op   1.11× slower

But I use prime? all over the place in my codebase, and when it's used inside loops in other methods, I saw a 3+ second regression in some other methods for certain inputs.

It pays to have to a good test suite that sufficiently covers your use cases!

Here's the ultimate "fix" to this issue|problem

What the test results showed is that modpn is being recalculated at runtime whenever prime? is used. So what I want is for it to be made CONSTANT during compilation.

In Rust, et al, I can define constants inside the method, but Crystal doesn't allow that (hopefully in 2.0 it will), so I had to do it external to its use. So in the test code I did the following:

module Primes
  module Utils
    MODPN1 = 232862364358497360900063316880507363070u128.to_big_i
    MODPN2 = "232862364358497360900063316880507363070".to_big_i

Then inside each version used as: return false if MODPN[1|2].gcd(self.to_big_i) != 1 This produces these much faster (more operations per time period), and equal results.

next_prime1  24.90k ( 40.16µs) (± 0.30%)  43.1kB/op   1.01× slower
next_prime2  25.05k ( 39.92µs) (± 0.21%)  43.1kB/op        fastest

next_prime1 212.58k (  4.70µs) (± 0.15%)  2.88kB/op        fastest
next_prime2 212.53k (  4.71µs) (± 0.11%)  2.88kB/op   1.00× slower

next_prime1 223.79k (  4.47µs) (± 0.15%)  2.27kB/op        fastest
next_prime2 222.61k (  4.49µs) (± 0.20%)  2.27kB/op   1.01× slower

next_prime1 438.20k (  2.28µs) (± 0.05%)  664B/op        fastest
next_prime2 436.85k (  2.29µs) (± 0.07%)  664B/op   1.00× slower

next_prime1  65.07k ( 15.37µs) (± 0.07%)  15.2kB/op   1.00× slower
next_prime2  65.34k ( 15.30µs) (± 0.08%)  15.2kB/op        fastest

next_prime1 153.43k (  6.52µs) (± 0.07%)  3.39kB/op   1.01× slower
next_prime2 154.23k (  6.48µs) (± 0.10%)  3.39kB/op        fastest

next_prime1 309.33k (  3.23µs) (± 0.07%)  2.73kB/op        fastest
next_prime2 308.97k (  3.24µs) (± 0.08%)  2.73kB/op   1.00× slower

next_prime1 210.96k (  4.74µs) (± 0.10%)  3.01kB/op        fastest
next_prime2 210.67k (  4.75µs) (± 0.15%)  3.01kB/op   1.00× slower

next_prime1 198.40k (  5.04µs) (± 0.12%)  2.52kB/op   1.01× slower
next_prime2 199.47k (  5.01µs) (± 0.09%)  2.52kB/op        fastest

Again, this kind of performance enhancing|degrading information needs to be documented.

For people coming from Ruby, etc, with no experience writing optimum performant code for compiled languages in general, and Crystal specifically, may run into performance issues and think Crystal is slower than some other language, when it was them not knowing the tricks to writing optimum performant Crystal.

So maybe the title of this issue should be change to something like: Performance enhancing issues so when people do searches it may catch their attention better.

crystal-lang / crystal

Performance differences using to_u128 #13692