JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.68k stars 5.48k forks source link

improve prime sieve performance #11594

Closed StefanKarpinski closed 8 years ago

StefanKarpinski commented 9 years ago

Someone emailed me directly with substantial performance improvements to our prime sieve implementation. Unfortunately, they didn't give permission to use the code under the MIT license and haven't replied to me, so I'm going to post the non-code parts of their email so that someone else who hasn't seen their code can do an independent implementation of these improvements.

  • Instead of the for loop running to n, we only need to go up to isqrt(n)
  • Further, we can loop only over odd values of p, and set only odd multiple of p^2 to false.
  • Not finished yet – we can vectorise the last loop, with a further gain
  • Finally, we can use the facts that
    • all large primes are of the form 6k±1
    • most integers are composite, so it is better to set all flags to false initially, then invert for possible primes

It would have been nice to just have a pull request fixing this (with an implied MIT license) since this individual went to the trouble of making and benchmarking these improvements, but instead here we are.

StefanKarpinski commented 9 years ago

Ping @aiorla since you seem interested in the primes functionality.

aiorla commented 9 years ago

Thanks for the ping @StefanKarpinski! Actually I didn't look into primesmask much when I did the factor push, but looking at the things you say it seems that he's talking of a (really) old Julia, doesn't it? (now there is no loop running to n, we use Sieve of Atkin, falses(n)...)

PS: I'm in finals right now but when I finish (23rd of June) I will try to do a definitive primes.jl PR to have a competitive implementation of factor (the main problem is currently the slow powermod for Int128 ( #3036) that makes it unusable so I will try to fix it). Then I'll go to another (more relevant?) part of the code! (maybe I'll ask you for suggestions :smile:)

StefanKarpinski commented 9 years ago

I'm pretty sure the comments were relevant to this version.

aiorla commented 9 years ago

I'm sorry, I don't get it then. :disappointed: In the current julia: all the for loops in the function run to 1:r (where r = isqrt(n)), it starts setting all the integers with a not-prime flag (primesmask(n::Int) = primesmask(falses(n))) and the fact that large primes are 6k±1 it's implicit in the Sieve of Atkin.

Do you mean that he has implemented a different algorithm (I suppose that a version of the Sieve of Eratosthenes with those improvements) and it beats the master?

VicDrastik commented 9 years ago

Stefan, I am the 'someone' who emailed last week about the sieve. I replied to your reply last Tuesday, but the email bounced and I was only notified about this yesterday. I emailed you yet again yesterday - still waiting for a reply from you. One (or both) of us are having email problems - if you didn't get my email from yesterday, please send me another working email address for yourself.

The purpose of my first email was not merely to speed up the sieve, but to improve primes.jl in other ways as well. The code I sent was just an initial proposal and thus a basis for discussion.

ViralBShah commented 9 years ago

@VicDrastik Why not just submit a PR right here - which will certainly be well received and also get feedback from others?

StefanKarpinski commented 9 years ago

@VicDrastik, sorry about the email issues – whether they are on your end or mine. Since email doesn't seem to be a reliable medium of communication in this case, would you mind submitting a pull request? Can you address some of @aiorla's confusion about which version of the number sieve you improved?

VicDrastik commented 9 years ago

Since there is so much interest in the code, here it is

function e4(s::AbstractVector{Bool})
    n=length(s); s[:] = true; s[1] = false; s[4:2:n] = false
    for p = 3:2:isqrt(n) if s[p] s[p*p:2p:n] = false end end
end

And here is the fastest version:

function e5(s::AbstractVector{Bool})
    n = length(s)
    s[:] = false
    s[2] = true
    s[3] = true
    s[5:6:n] = true
    s[7:6:n] = true
    for p = 5:2:isqrt(n) if s[p] s[p*p:2p:n] = false end end
end

Feel free to make any use of this you like - the only new idea here is that in e5() I first set all flags to composite, then set the flags for possible primes to true; the opposite of the usual method as exemplified by e4() above.

But primes.jl does not export the sieve function primesmask() and I think it should. In fact, primes.jl should provide 2 sieve functions - a standard sieve as above, but starting at 'a', and running from a+1 to a+n, and a wheel sieve, which is a much more compact sieve. Storing a list of all 32-bit prime takes about 811 MB if they are stored as integers, 537 MB if stored as a binary sieve as above, but only 143 MB if stored as a wheel sieve. Binary sieves are useful when working with composites, but wheel sieves are more useful if working with primes, so it would be nice if Julia provided both.

VicDrastik commented 9 years ago

Since Stefan is happy to discuss this topic publicly, here is the second half of my second email to him:

Ideally, I would like to re-design and re-write primes.jl entirely. At present, primesmask() is not exported, but I need something like it for my work. I would like to have a sieve function that starts not at 1, but at 'a', and sieves the integers a+1:a+n i.e. primesmask(s,a) returns a sieve of length n starting at base a+1.

Instead of factor(n), which returns a dictionary of the factors (and their multiplicities) of n, I would like a function called trialdivision(n,limit=isqrt(n)) which simply returns a (sorted) array of prime factors of n less than or equal to limit. If limit is omitted, we just get a list of all (small) prime factors of n, which can then trivially be used to create the same dictionary that factor() now returns. Trialdivision() would not call primes(), but instead divide first by 2 and 3, and then by 6k-1 and 6k+1 for k=1,2,3,.... This should be both simpler and faster than the current function.

isprime() also needs improvement. First, it should do trial division up to some limit (usually n^(1/4)). If n < 2^32 , isprime() now does Miller-Rabin with the 3 bases 2,7, and 61, but I have found a way to do it with only 2 M-R tests.

Algorithm

[1] do a M-R base 2 test

[2] do a M-R test with the smallest odd number with the opposite Jacobi symbol to 2 i.e. use smallest odd a such that J(2,n)*J(a,n) = -1 (one of these will be a quadratic residue mod n, and the other will not (this is actually not quite correct, but the idea works!) )

[3] This leaves only 42 pseudoprimes. 13 of these are of the form (x+1)(2x+1) and 19 are of the form (x+1)(4x+1) which can be quickly tested for with a single square root ( that reminds me : we also need to add a perfect square test to primes.jl ( isperfectsquare(n) = (n==(isqrt(n)^2) ) )

[4] This leaves only 10 composite values ( 953_2381, 601_100801, 109_541_2269, 6703_53617, 4723_113329, 7369_93329, 421_701_2381, 14621_87721, 151_751_28351, 13381*240841) which can be eliminated using gcd or simply by dividing by the smallest factor of each or even by simple table search.

The same trick can be used to reduce the number of M-R tests for 64-bit integers from 7 to 6, but I would not do it that way. Khashin, in his 2013 paper (Counterexamples for Frobenius primality test, arxiv.org/abs/1307.7920 ) has shown that a single Frobenius test (which takes the same time as 3 M-R tests) has no pseudoprimes less than 2^60. I have run this test on Feitsma's list of M-R(2) 64-bit pseudoprimes and found no counterexample. So the algorithm would be:

[1] trial divide up to some optimal value (at least 17) [2] do a M-R base 2 test [3] do a Frobenius test

In the worst case, this takes 4 M-R, rather than 7 M-R as in the current isprime().

I would be happy to do this work myself if you like.

aiorla commented 9 years ago

I'm not an expert in number theory or a competent julia programmer but since I've recently spent some time in primes.jl I would like to post my position about some of the ideas you propose.

primesmask() / primes(n)

factor(n)

isprime(n)

ironman353 commented 9 years ago

I am not familiar with Julia. Can you please post some benchmark results of your prime-sieve algorithm? Usually a good segmented sieve takes O(n log log n) operations and uses O(sqrt(n)) space. Pi(n) = No. of primes generated for the limit n Run-time of the java code for generating all the primes up to a limit, one that I wrote a few days back: Pi(10^7) = 664579 in 0.046 seconds Pi(10^8) = 5761455 in 0.240 seconds Pi(10^9) = 50847534 in 2.313 seconds Pi(10^10) = 455052511 in 27.612 seconds Pi(10^11) = 4118054813 in 358.159 seconds Pi(10^12) = 37607912018 in 5376.441 seconds

aiorla commented 9 years ago

Ok, so I was curious and did a bit of testing and I was quite wrong about our sieve speed. (as I said I was maybe mistaken :disappointed:)

e5(10^7) => 0.055 s - julia = 0.236 s e5(10^8) => 0.7 s - julia = 4.96 s e5(10^9) => 8.443 s - julia = 62.35 s

StefanKarpinski commented 9 years ago

@VicDrastik, I'm in favor of your proposed improvements – a pull request would be great. Let us know if you have any issues or problems.

ironman353 commented 9 years ago

Here is an improved version of segmented sieve (along with the benchmark). It works fine for any limit up to at least 10^12. @StefanKarpinski, @VicDrastik, @aiorla, Any idea for further improvement will be appreciated.

function Eratosthenes_1(MAX::Int)
  pi::Int64 = 0;
  isPrime::Array{Bool} = trues(MAX)
  j::Int = 0
  SQRT_MAX::Int = isqrt(MAX)
  for i::Int = 2 : SQRT_MAX
    if isPrime[i]
      j = i * i
      while j < MAX
        isPrime[j] = false
        j += i
      end
    end
  end
  for i::Int = 2 : MAX - 1
    if isPrime[i]
      pi += 1
    end
  end
  return pi
end

function smallPrimes(B::Int64)
  MAX::Int = isqrt(B) + 1
  pi::Int = 0;
  isPrime::Array{Bool} = trues(MAX)
  primes::Array{Int64} = zeros(5150)
  j::Int = 0
  SQRT_MAX::Int = isqrt(MAX)
  for i::Int = 2 : SQRT_MAX
    if isPrime[i]
      j = i * i
      while j < MAX
        isPrime[j] = false
        j += i
      end
    end
  end
  for i::Int = 2 : MAX - 1
    if isPrime[i]
      pi += 1
      primes[pi] = i
    end
  end
  primes[5150] = pi
  return primes
end

function countLargePrimes(A::Int64, B::Int64)
  primes = smallPrimes(B)
  largePrimes::Array{Bool} = trues(B - A + 1)
  lim::Int64 = isqrt(B) + 1
  p::Int64 = 0
  s::Int64 = 0
  q::Int64 = primes[5150]
  count::Int64 = 0
  for i::Int64 = 1 : q
    p = primes[i]
    if p > lim
      break
    end
    s = floor(A / p)
    s *= p
    for j::Int64 = s : p : B
      if j < A
        continue
      end
      largePrimes[j - A + 1] = false
    end
  end
  for i::Int = 1 : q
    p = primes[i]
    if p > B
      break
    end
    if p >= A && p <= B
      count += 1
    end
  end
  for i::Int = 1 : B - A + 1
    if i + A <= q
      continue
    end
    if largePrimes[i]
      count += 1
    end
  end
  return count
end

function full_sieve(MAX::Int64)
  if MAX <= 1000000
    return Eratosthenes_1(MAX)
  end
  actualLimit::Int64 = MAX
  count::Int64 = 1
  sep::Int = 50000
  limit::Int64 = floor(actualLimit / 100000) * 100000
  smsize::Int = isqrt(limit + 1)
  sievelimit::Int = isqrt(smsize)
  smallsieve::Array{Bool} = trues(smsize + 1)
  i::Int64 = 0
  j::Int64 = 0
  ptop::Int64 = 0
  p::Int64 = 0
  offset::Int64 = 0
  step::Int = 5
  i = 4
  while i <= smsize
    smallsieve[i] = false
    i += 2
  end
  i = 3
  while i <= sievelimit
    if smallsieve[i]
      j = i * i
      while j <= smsize
        smallsieve[j] = false
        j += (i + i)
      end
    end
    i += 2
  end
  ptop = 1
  primes::Array{Int} = zeros(smsize + 1)
  index::Array{Int64} = zeros(smsize + 1)
  primes[1] = 2
  index[1] = 2
  for idx::Int64 = 3 : smsize
    if smallsieve[idx]
      ptop += 1
      primes[ptop] = idx
      index[ptop] = (idx * idx) >> 1
    end
  end
  bigsieve::Array{Bool} = trues(sep)
  while offset < limit
    if offset % 1000000000 == 0
      println(offset)
    end
    fill!(bigsieve, true)
    for idx::Int = 2 : ptop
      if index[idx] < sep
        j = index[idx]
        bigsieve[j + 1] = false
        p = primes[idx]
        j += p;
        while j < sep
          bigsieve[j + 1] = false
          j += p;
        end
        index[idx] = j;
      end
    end
    for idx::Int64 = 0 : step - 1
      i = idx
      if (2 * i + 1) % 5 == 0
        continue
      end
      #prime::Int64 = 0
      while i < sep
        if (bigsieve[i + 1]) && ((offset + i >> 1 + 1) < limit)
          # prime = offset + (i >> 1) + 1
          count += 1
        end
        i += step;
      end
    end
    for idx::Int64 = 1 : ptop + 1
      index[idx] -= sep;
    end
    offset += (sep + sep)
  end
  if actualLimit > limit
    count += countLargePrimes(limit + 1, actualLimit)
  end
  return count
end

@time full_sieve(10000000000);

# include("E:/Julia/full_sieve_2.jl")

# Pi(10^7)  = 664579 in 0.108 seconds
# Pi(10^8)  = 5761455 in 0.329 seconds
# Pi(10^9)  = 50847534 in 2.726 seconds
# Pi(10^10) = 455052511 in 29.697 seconds
# Pi(10^11) = 4118054813 in 387.659 seconds
# Pi(10^12) = 37607912018 in 5783.603 seconds
danaj commented 9 years ago

@VicDrastik,

You can do deterministic 32-bit primality testing with a single M-R test using a small table of 256 shorts. Alternately just use BPSW which will be 1 Selfridge for most numbers, 2.5-3 for primes (vs. 1 for all numbers with the hash, or 1 for most / 2 for primes with yours, plus the tests for counterexamples).

Khashin's paper is interesting, but be aware that his Frobenius test doesn't obviously match either Grantham or Crandall/Pomerance. I'm quite interested if you can explain the mapping or show the code you used. There are certainly Frobenius pseudoprimes based on Grantham, C&P, etc. But that's really a digression, as there is a better solution.

Use the strong, "almost extra strong", or extra strong Lucas test after a M-R base-2 test. This is the BPSW test that has been in use since 1980. It's the one used by Pari, Maple, Mathematica, Sage, FLINT, Perl/ntheory, and others. It is deterministic to 2^64 and will be faster than the Frobenius test. The total cost is ~ 2.5 - 3 Selfridges (1 for the M-R test, 1.5 - 2 for the Lucas test). It also has the advantage of having much more use behind it in other software compared to Khashin's Frobenius test.

Some software likes to add some extra tests for >64-bit. Mathematica adds a M-R base 3, Perl/ntheory adds 1-5 random-base M-R tests depending on size. It's debatable how much that really buys you, considering nobody has yet found the first BPSW counterexample (even when weakening the test by removing the "strong" part of both tests).

VicDrastik commented 9 years ago

ironman353, your sieve's benchmarks are much slower than mine. Here are the timings on a sieve of size 160,000,000:

julia> @time e5(s) elapsed time: 5.289495638 seconds (24260 bytes allocated)

julia> @time Eratosthenes_1(n) elapsed time: 15.940241054 seconds (180459292 bytes allocated, 0.65% gc time)

julia> @time Eratosthenes_1(n) elapsed time: 15.909622857 seconds (180000436 bytes allocated, 0.66% gc time)

julia> @time Eratosthenes_1(n) elapsed time: 15.95310164 seconds (180000436 bytes allocated, 0.63% gc time)

Takes 3 times as long, and uses lots more space.

VicDrastik commented 9 years ago

danaj,

When you say "small table of 256 shorts" I presume you are referring to the paper of Forisek and Jancina. But "small" still is 512 bytes, which is a high price to pay for a tiny increase in speed. And this speedup only occurs for the primes (and a handful of 2-pseudoprimes), which comprise less than 5% of 32-bit integers. Not worth it.

BPSW is harder to code correctly, and slower to run than my idea. Also, you need to add a squareness test before you do the Lucas part. Again, not worth it.

You said "his Frobenius test doesn't obviously match either Grantham or Crandall/Pomerance" - that is actually the point. Grantham et. al. give upper bounds on the error rate of their tests ( 1/7710 for Grantham, for example) and hope for the best. Khashin's error rate is provably zero for n<2^60 and he implies that it is also probably zero for some larger n.

If you want to check Khashin's test yourself, download Feitsma's list of pseudoprimes from here http://www.cecm.sfu.ca/Pseudoprimes/index-2-to-64.html or a much shorter list of strong MR(2) pseudoprimes from here https://miller-rabin.appspot.com/

You will need code to find the Jacobi symbol - here is mine:

function jacobisymbol(a::Integer,n::Integer) if a==0 return 0 end

set j depending on sign of a

j=1 if a < 0 if (n&0b11)==3 j = -1 end a = -a end

check if a is too large

if a>=n a = mod(a,n) end # while a != 0 r = abs((n&0b111)-4) while (a&1)==0 a = a>>1 if r==1 j = -j end end # a,n = n,a # if (a&0b11)==3 if (n&0b11)==3 j = -j end end a = mod(a,n) end # if n==1 return j else return 0 end end

Yes, you are right that BPSW is widely used, but that is only because it seems to work. Nobody knows why - the best explanation is that the pseudoprimes for the two parts of the test seem to fall into different residue classes, so that even though the MR test has many pseudoprimes, and the Lucas test even more, they have none in common. It has been conjectured that BPSW pseudoprimes are dense, although, to date, not one has been found. I would wager a large sum that the number of Frobenius pseudoprimes for n<2^128 is less than or equal to the number of BPSW pseudoprimes.

"Some software likes to add some extra tests for >64-bit. Mathematica adds a M-R base 3" - do you know why? Because it is necessary to exclude the squares of the Weiferich primes like 1093^2 and 3511^2 which are MR(2) pseudoprimes. But they do it using an extra MR test because it seems to work for small n, and they do not fully trust the BPSW algorithm.

danaj commented 9 years ago

VicDrastik, yes Forisek and Jancina published this method. There were 2-base and 3-base hashed solutions for 2^49 and 2^64 respectively released earlier (Worley, Berg, Jacobsen). If you think the 512 byte table isn't worth it, that's fine -- I just wanted to let you know about it since you seemed interested in reducing the number of tests.

What do you mean by: "Also, you need to add a squareness test before you do the Lucas part. Again, not worth it." All the definitions and theorems from Khashin's paper have something like "...that it is not a perfect square..." in them. You need a perfect square test to use Khashin's results.

I can't say whether a Lucas test is harder or not, because I still have never seen an implementation of Khashin's method. To my knowledge nobody has used it. You say you have code and that it's much easier than a Lucas test, but only give a Jacobi function. Are you using definition 1.1, doing a perfect square test followed by comparing (1+sqrt(c))^n equiv (1-sqrt(c)) mod n? Definition 1.3 which has 3 variables all unspecified (other than (c/n)=-1? Remark 1.4? The theorems in section 2 all involve a known factor of n so are just used for computational results.

You say "Khashin's error rate is provably zero for n<2^60". The paper doesn't say that. Read the paragraph in the conclusions -- all those results only apply when c < 128, and he hasn't identified numbers where that holds other than to say they are rare. You're left with a test that seems awfully good for small numbers, is probably good for larger ones but we don't know, and has never been used by anyone other than the author. How is this better than the Frobenius-Underwood test which is faster and has zero counterexamples to 2^50 (with none known above), or BPSW with none to 2^64?

Grantham's bounds are proven bounds, and also apply to his QFT which is much more involved (e.g. step 1 is trial division up to 50,000). Khashin gives nothing here. I don't understand how you can compare the QFT 1/7710 bound to what Khashin shows in his paper.

BPSW working: "Nobody knows why" The 1980 paper makes it pretty clear that the numbers tend to lie in different residue classes. Unless you've worked for Wolfram, neither of us actually know why Mathematica adds the base-3 MR test. It's not necessary for Weiferich primes if a strong Lucas-Selfridge test is done. As for their early code, Pinch's 1993 paper is pretty damning, with incorrect implementations and custom solutions responsible for lots of problems. It could be that they do the base-3 test because their Lucas test is still borked (dubious). Or perhaps they updated it and added the extra base as a hedge against something happening again. Their notes still say "As of 1997, this procedure is known to be correct only for n < 10^16 [...]".

ironman353 commented 9 years ago

@VicDrastik, full_sieve() is the original function. When the limit is smaller than 10^6, Eratosthenes_1() is called by full_sieve(). Timings of full_sieve() function are

Pi(10^7) = 664579 in 0.108 seconds Pi(10^8) = 5761455 in 0.329 seconds Pi(10^9) = 50847534 in 2.792 seconds Pi(10^10) = 455052511 in 29.697 seconds Pi(10^11) = 4118054813 in 387.659 seconds Pi(10^12) = 37607912018 in 5783.603 seconds

So it is faster than the function e5()

full_sieve() function is written in the lower part of my previous comment. While running the code, you have to use this function to get better timing.

aiorla commented 9 years ago

@ironman353 sorry I'm on exams and I have not much time to think about it, just a small comment:

When I finish I will gladly read the code. (if this issue is still open)

ironman353 commented 9 years ago

@aiorla, thanks for the suggestion. I just started learning julia, so I didn't know about the function isqrt(). I shall be glad if you check the code after your exam.

VicDrastik commented 9 years ago

Let's not confuse algorithms for 32-bit integers with those for 64-bit integers: Khashin's test of course works for 32-bit integers, but it is overkill to use it, and the same goes for BPSW. We can get away with using only M-R for small n i.e. both Khashin and BPSW should only be used for large n. There are 2314 MR(2) 32-bit spsp's, and any other (fixed) MR test reduces this to between 70 and 110 spsp's, normally needing yet another test to clean up (e.g. bases 2,7, and 61). My quadratic residue idea produces only 42 spsp's, but 32 of these can be quickly eliminated, and the remaining 10 can be zapped either using a table or simply 10 (fixed) trial divisions.

Definition 1.1 of Khashin is the correct one (the others are just used in the proof, not in the actual algorithm), so

let c be an odd positive number (does not need to be a prime, but it always will be) let d = sqrt(c) let j = jacobisymbol(c,n)

Then, if n is prime, (1+d)^n === (1+j*d) (mod n)

Khashin then proves that if we choose the smallest odd c such that j==-1, then there are no Frobenius pseudoprimes less than 2^60 (assuming gcd(2.3.5.7.11.13.17,n)=1). It is also required that c<128 but this turns out to be only of theoretical interest.

I have to concede your point that using the word "provably" is too large a claim. Also, that he does not explicitly claim this unconditional result. When I read Khashin's paper, I just automatically assumed he would always require a MR(2) test first and I immediately tested whether c<128 for base 2 strong pseudoprimes. Here is my code and the results:

function _()
f=open("2-SPRP-2-to-64.txt","r")
g=open("output.txt","w")
count=zeros(Int,999)
for line in eachline(f)
 n=uint64(line)
 if n==3511^2 continue end
 if n==1093^2 continue end
 for c=3:2:999
  if jacobisymbol(c,n)<1
   count[c]+=1  
   break
  end
 end
end
for i=1:999
 if count[i]>0 println(g,i," , ",count[i]) end
end
println(g,sum(count)+2)
close(g)
close(f)
end

3 , 11515748 5 , 7836069 7 , 5608159 11 , 3275113 13 , 1727914 17 , 926810 19 , 487767 23 , 254188 29 , 130373 31 , 65303 37 , 33069 41 , 16793 43 , 8289 47 , 4218 53 , 2097 59 , 1016 61 , 548 67 , 276 71 , 138 73 , 59 79 , 32 83 , 18 89 , 7 97 , 4 101 , 2 103 , 1 107 , 1

31894014

so, the largest c value needed is c=107. So, yes, this is not a "proof", because it depends on computed results rather than algebra. But, assuming Feitsma's table and my code are correct, this covers all pseudoprimes up to 2^60. Naturally, it does not cover composites up to 2^60 which are not MR(2) pseudoprimes, but it doesn't have to if you do a MR(2) test first.

Several years ago, before Khashin's paper, I had a lengthy exchange of emails with Paul Underwood about his Frobenius-style test. Writing them both in a common notation, we can see the similarities

Underwood-Frobenius: for any integer p such that jacobisymbol(p^2-4,n)==-1, (2+x)^(n+1) === 5+2_p mod(x^2-p_x+1),(mod n)

Khashin-Frobenius: for the smallest odd integer c such that jacobisymbol(c,n)==-1, (1+x)^(n+1) === 1-c mod(x^2-c),(mod n)

Underwood's test is not faster (if anything, it has to be slightly slower). I suggested to Paul that he should run his test on Feitsma's Fermat(2) database, and he found no counterexamples, so we should not be surprised when Khashin's version also passes.(when I have time, I will run Khashin's test on Feitsma's full data set, not just the strong pseudoprimes)

Grantham's seminal paper was great, but several authors(including Khashin) have since suggested that his 1/7710 bound was far too pessimistic.

The only modern test that can be compared to Khashin's is Loebenberger's version of the Crandall/Pomerance Frobenius test. Here is that algorithm in Python:

def frobenius(n):
    """ returns True if n is probably prime
    if n % 2 == 0 or n < 2 then   return False
    #step 1
    while 1:
        a = randint(1, n-1)
        b = randint(1, n-1)
        d = a ** 2 - 4 * b
        if  not (is_square(d) or gcd(2 * a  * b * d, n) != 1) then break
    #step 2
    m = (n - jacobi(d, n)) // 2
    w = (a ** 2  * inverse(b, n) - 2) % n
    u = 2
    v = w 
    # step 3
    for bit in bits_of(m):
        if bit:
            u = (u * v - w) % n 
            v = (v * v - 2) % n
        else:
            v = (u * v - w) % n
            u = (u * u - 2) % n
    # step 4
    if (w * u - 2 * v) % n != 0 then return False
    return (pow(b, (n - 1) // 2, n) * u) % n ==  2

The Lucas part of this takes only 2 selfridges, rather than 3 for Khashin/Underwood, but the parameters are random, and so finding pseudoprimes is not possible.

danaj commented 9 years ago

I suspect a new issue should be here for isprime, as sieving is really another subject. I strongly suspect a well-written SoE will beat the SoA by a wide margin. Optimizing the SoA is hard and makes very unreadable code, and is still slower than a SoE with equally complex code.

@VicDrastik Making the reasonable assumption that we want to simplify 32-bit for performance rather than using the same test, how about this:

  1. Do trial division up to some point. Benchmarks on random inputs would determine the optimal amount, since it will depend on your trial division method and speed and mostly on the relative speed of the M-R test. to 11, 17, 53, whatever.
  2. M-R base 2. We will use this for both 32-bit and 64-bit. For 32-bit, we have 2314 composites remaining, for 64-bit 31.9M.
  3. If 32-bit, use a 16 entry hashed M-R. That's less space than your idea, I believe. a) use a Mueller style hash:
        x = n;
        x = (((x >> 16) ^ x *  0x45d9f3b) & 0xFFFFFFFFUL;
        x = ((x >> 16) ^ x) & 15;
b) use it to index into a table of bases:
        static const uint16_t mr_bases_second[16] =
                  {2249,483,194,199,15,369,499,945,419,735,33,471,946,615,497,702};
c) M-R( mr_bases_second[x] )

Two bases, deterministic for all 32-bit numbers with only 16 small bases in a table, and no exceptions cases (no requirement for any trial division either).

For 64-bit, step 3 is some other test. I prefer a (strong,AES,ES) Lucas test making BPSW, you prefer Khashin's Frobenius test. For 64-bit inputs they're equally good (anything will work here as long as you run it through the SPSP-2 database), so now we're debating performance and coding complexity for 64-bit, and performance + probability for larger (er...are we discussing larger than 64-bit here, yes? If not then almost all of this discussion is moot -- just use what passes the SPSP-2 database and runs fastest).

On to Khashin. I wrote a version last night that works with definition 1.1, using binary ladder exponentiation and treating the multiplication similar to complex numbers. It's slower than a Lucas test, but the code could be optimized, and Montgomery math would speed it up a bit. Improvements or suggestions welcome.

  int k = 0;
  UV c = 1;
  if (is_perfect_square(n)) return 0;
  while (k != -1) {
    c += 2;
    k = kronecker_uu(c, n);
  }
  /* Next line isn't necessary */
  if (!is_prob_prime(c)) croak("Khashin invalid c for %lu\n", n);
  {
    UV ra = 1, rb = 1, a = 1, b = 1, k = n-1;
    while (k) {
      if (k & 1) {
        UV ta=ra, tb=rb;
        ra = addmod( mulmod(ta,a,n), mulmod(mulmod(tb,b,n),c,n), n );
        rb = addmod( mulmod(tb,a,n), mulmod(ta,b,n), n);
      }
      k >>= 1;
      if (k) {
        UV ta=a, tb=b;
        a = addmod( sqrmod(ta,n), mulmod(sqrmod(tb,n),c,n), n );
        b = mulmod(ta,tb,n);
        b = addmod(b,b,n);
      }
    }
    return (ra == 1 && rb == n-1);
  }

To be faster than Underwood's test it would need to be written differently. His test uses only 2 full mulmods per bit. AES Lucas uses only 2, Lucas with Q=+/-1 uses 2. This uses 4+. This is a simple literal translation of (1+1D)^n mod n [D=sqrt(c)], so could likely be improved.

The count of c values over SPSP-2s I get is similar to yours, though not exact. Your Jacobi function has two issues. The first and most important is that the while (a is even) loop needs to check n%8==3 and n%8==5. You're only testing the latter. The second issue might be irrelevant in your language, but be careful of taking in signed integers when the input can be 64-bit unsigned.

I'm perfectly fine with saying "assuming the Feitsma/Galway database is correct" -- I certainly use that assumption. The math involved in restricting the candidates has been double checked, though that still leaves the computation.

I've run the Frobenius-Khashin code above over the full PSP database. It only took 15 minutes to run. No prefilters other than n%2.

Re Paul Underwood, His minimal lambda +2 test was first posted on 30 Apr 2012. The updated version on 21 May 2013. 5 days after his new test was released I had tested over both Feitsma databases and notified him. I released it in my software in June 2013.

I released a version of Loebenberger's modification to the C/P Frobenius in January to my GMP project. It's just used internally however.

I think the point of the F-U test is that you can use it without a Miller-Rabin test in front, hence the overall speed is comparable to BPSW (sometimes faster) and quite a bit faster than a standard Frobenius test. The problems I see are:

  1. Without the M-R base-2 test at the front, it's quite a bit slower on almost all composites because it doesn't have an early exit.
  2. It's been tested to 2^50, which while large still leaves something to be desired. Put a M-R base-2 test in front and it's good to 2^64 of course, but then it's slower than BPSW.
  3. BPSW is known to work to 2^64, and has been used by almost everyone on a daily basis for 20 years now. Years of grad student's lives have been devoted to intelligently searching for counterexamples, not to mention some faculty. Primo uses this test on every step of its proofs, and hasn't found one in 15+ years. This doesn't prove anything, but it gives a great deal more confidence vs. other tests which look good on paper.

Basically I have the same objections to Khashin's test as I do to F-U. If done with a M-R base 2 test in front then (2) and (3) still apply. For some projects that want more certainty (my prime gap verifier for instance) I've used BPSW followed by F-U. If we use C/P or Loebenberger style, then I'd say use the Selfridge parameter selection so we get BPSW with the extra check (Frobenius) at the end. That gives you some extra security at little cost.

tkelman commented 9 years ago

While this conversation is fascinating, I feel like it might be better-suited for the julia-users mailing list. I'm not aware of any applications for which having primality testing as part of the standard library is an absolute requirement when it could just as well be implemented in a package, but correct me if you know of some. Perhaps the gmp integration calls for having at least a basic implementation. A PR to improve what we have right now would be welcome. Number-theory-related packages can go ahead and implement all sorts of different algorithmic variants, if there isn't consensus on a single best choice.

VicDrastik commented 9 years ago

Your 32-bit hash idea is just what we need - it is simpler, faster, shorter and although its table uses 12 extra bytes more than mine, the saving in code length is surely far more than this, so it is better in every way. Pure gold!

I wonder if we can squeeze even more performance out of this idea - using just a single base 2 MR test, we can correctly test up to n=2046. If we did a few trial divisions and then used a hash with MR 16 bases, how large could n be? 2^16, maybe?

My reason for liking Khashin's Frobenius test is that it has strong theoretical justification and so can be relied upon for n>2^64 more than the older methods. But, as you point out, it is clearly at least 50% slower for n<2^64 (on primes, that is). I have just realised that for n>2^64, Julia will promote to BigInts and use external libraries, and so performance will be very poor for any method, including Khashin. So your suggestion that Loebenberger's version of C/P can be equivalent to BPSW just by using Selfridge parameter selection sounds good - do you mean that we choose b=2, d=a^2-4b,j=jacobisymbol(d,n) and then look for a value for a which makes j==-1 ? If so, this is definitely simpler and faster for n<2^64.

You are right about Montgomery modular multiply - eventually it will be necessary to use this to avoid automatic promotion to BigInts.

About the Jacobi symbol - first, I was not counting j==-1, but j<1, because j==0 immediately shows that n is composite. Second, my Jacobi code is OK; in the "a is even" loop, it tests both cases with a single comparison because r is defined such that r==1 covers both n%8==3 and n%8==5. (r=abs((n&0b111)-4))

Good to know that Khashin-Frobenius works on Fermat psp's and not just MR spsp's.

ScottPJones commented 9 years ago

@VicDrastik Where do you see julia promoting to BigInt? I get the following pretty strange results (no promotion at all, and results going from positive to negative)

julia> 2^63
-9223372036854775808
julia> 2^62
4611686018427387904
julia> 2^64
0
julia> Int128(2)^64
18446744073709551616
julia> Int128(2)^128
0
julia> Int128(2)^127
-170141183460469231731687303715884105728
IainNZ commented 9 years ago

Julia doesn't promote to BigInt, and the results you are getting are from integer overflow (and aren't strange, or at least are well-defined), e.g.

julia> typemin(Int64) == 2^63
true
danaj commented 9 years ago

@VicDrastik We can shorten to only 11 values (use mod 11 in the hash instead of & 15): 12829, 8621, 11795, 1994, 4190, 3407, 603, 627, 351, 11147, 10734 Of course double check once implemented! That is M-R base 2, then the selected second base. No trial division requirements necessary other than being odd.

I haven't found deeper trial division to be very helpful in making smaller hashes. It just doesn't remove enough values. As for trial division helping with an early exit, factoring to 53 (not unreasonable) gets you to 42799. You have to go up to 157 to get past 49141 and then you hit 104653. Then you need 229 to get you to 256999, and so on.

If you do trial factoring through 53 at the start, then you have your answer for all n < 3481 (59*59) without running a single M-R test. I'd then split off small numbers into a trial factoring loop (say below 500k). It all depends on relative speeds in your language/implementation, so the best cutoff will be different.

Re: Loebenberger's Frobenius, I was thinking of the Selfridge selection (method A from Baillie/Wagstaff 1980): P=1, D 5,-7,9,-11,13,... until (D|n)=-1, then Q=(1-D)/4. This should be able to then apply to Loebenberger with a=P,b=Q. For 64-bit you can stop after the Lucas test part. Double check that logic though. There are still pseudoprimes from the Lucas part of course, they just don't overlap with the M-R base 2 pseudoprimes under 2^64.

Leobenberger's theorem 6 is very similar to the conditions for an extra strong Lucas pseudoprime, and the V condition is mentioned in Baillie/Wagstaff as well as an extra test with low cost. You could also look into the "almost extra strong" and "extra strong" Lucas tests, which use the V condition, with Q=1; P=3,5,7,... until (D|n)=-1 to make a very fast implementation. This is used by Pari, among others (Pari increments P by 2, Baillie suggests by 1, I don't think it matters). These still have readily findable pseudoprimes so for small numbers seem less stringent than the Frobenius tests we've discussed (Underwood, Khashin, or (P,2)).

ScottPJones commented 9 years ago

@IainNZ That's the sort of thing I'd have expected more from C, not from Julia, and I've already fixed one case where suddenly switching from positive to negative was deemed incorrect. To me, it seems that this should get an InexactError, instead of giving an incorrect result. (which is what you get if you try to convert 128 to a Int8, for example).

IainNZ commented 9 years ago

Its essential for fast integer performance though, so that ship sailed a long time ago. Note that there is a subtle distinction between performing operations on a type, and converting between types. e.g.

julia> int(BigInt(2)^64)
ERROR: InexactError()
 in convert at gmp.jl:132
 in int at base.jl:108

julia> 2^64
0

Int64 is a well-defined thing, it just loops around. Mashing arbitrary size numbers into an Int64 doesn't go, you cannot perform the operation meaningfully, hence the error. At least thats my reasoning.

ScottPJones commented 9 years ago

@IainNZ I just brought this up because of @VicDrastik's comment:

I have just realised that for n>2^64, Julia will promote to BigInts and use external libraries

which did not seem correct to me (and also, there is support for Int128 as well in Julia, does that end up using external libraries?) I also worked for decades on a language where all overflows were reported as errors...

KristofferC commented 9 years ago

"I also worked for decades on a language where all overflows were reported as errors..."

" (I've been using and implementing structures [multidimensional associative arrays], for the past 29 years,"

"I've spent almost 30 years as one of the developers/architects for a language/database system where that is done all the time."

"I came up with my own packed scheme for storing Unicode data much more efficiently back 18 years ago..."

"Over the years (27 dealing with national character sets, 20 with Unicode) I did a lot of work doing just these sorts of operations..."

" Also, as a C programmer for 35 years, I disagree with saying that in common parlance string means NUL-terminated... "

How is this relevant? What should we do with this all this information about what @ScottPJones have done for the last x years.

johnmyleswhite commented 9 years ago

Yes, let's avoid autobiographical information. It alienates some people and distracts from the technical discussions we need to have.

ScottPJones commented 9 years ago

OK, forget the number of years that I've been dealing with these issues, that doesn't take away from the technical points I've been trying to illustrate, about whether languages can/should report overflows as errors, about associative array design, schemes for storing Unicode (look at SCSU and BOCU-1), issues with 8-bit, multibyte and Unicode, and use of NUL-termination in C code/libraries.

johnmyleswhite commented 9 years ago

OK, forget the number of years that I've been dealing with these issues, that doesn't take away from the technical points I've been trying to illustrate,

But the issue is precisely that your tone does affect your ability to have effective technical conversations. People need to be excited to collaborate with you, which means that you need to write prose that doesn't upset anyone.

ScottPJones commented 9 years ago

Fine, I'm trying to learn better how to interact here... It is frustrating though when people pop up with a many line comment just to criticize me personally, that has absolutely nothing to do with the technical discussion at hand...

ScottPJones commented 9 years ago

The technical point I was making in that sentence that bothered @KristofferC, was that other languages do give an error, or promote to a larger size... Python for example:

>>> pow(2,64)
18446744073709551616
tkelman commented 9 years ago

(We have been a little overly harsh on @ScottPJones, hopefully we can all find a better way of working together so people avoid getting on each other's nerves quite as badly.)

other languages do give an error, or promote to a larger size... Python for example:

Do any other languages do so and still maintain integer arithmetic performance comparable to C?

ScottPJones commented 9 years ago

I'd like that! I totally understand why overflow is not checked for the basic operations, however, exponentiation was something that I didn't really view as being in the same class as +, -, *, /, or <<. It's expensive enough, that I would have thought overflow checking would be something that could be done without a large % penalty. I also think it would be nice, somehow, to have a "checked" arithmetic mode... just as in julia normally it does do bounds checking, but you can skip it with @inbounds. It seems that all math in julia is currently unsafe... (maybe we need add and unsafe_add, etc!)

quinnj commented 9 years ago

@ScottPJones, see https://github.com/JuliaLang/julia/issues/855

StefanKarpinski commented 9 years ago

Compare unchecked code for x^2 versions checked code for x*x (not even checked x^2):

julia> sqr(x) = x^2
sqr (generic function with 1 method)

julia> @code_native sqr(123)
    .section    __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
    pushq   %rbp
    movq    %rsp, %rbp
Source line: 1
    imulq   %rdi, %rdi
    movq    %rdi, %rax
    popq    %rbp
    ret

julia> checked_sqr(x) = checked_mul(x,x)
checked_sqr (generic function with 1 method)

julia> @code_native checked_sqr(123)
    .section    __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
    pushq   %rbp
    movq    %rsp, %rbp
    pushq   %r15
    pushq   %r14
    pushq   %rbx
    subq    $40, %rsp
    movq    %rdi, %r14
    movq    $6, -64(%rbp)
Source line: 1
    movabsq $jl_pgcstack, %rax
    movq    (%rax), %rcx
    movq    %rcx, -56(%rbp)
    leaq    -64(%rbp), %rcx
    movq    %rcx, (%rax)
    xorps   %xmm0, %xmm0
    movups  %xmm0, -48(%rbp)
    movq    $0, -32(%rbp)
    movabsq $4619782280, %rax       ## imm = 0x1135C4888
Source line: 1
    movq    (%rax), %rbx
    testq   %rbx, %rbx
    je  L218
    movq    %rbx, -48(%rbp)
    movabsq $jl_box_int64, %r15
    movq    %r14, %rdi
    callq   *%r15
    movq    %rax, -40(%rbp)
    movq    %r14, %rdi
    callq   *%r15
    movq    %rax, -32(%rbp)
    movq    -8(%rbx), %rax
    shrq    $2, %rax
    cmpq    $1141923900, %rax       ## imm = 0x4410603C
    je  L176
Source line: 1
    leaq    -48(%rbp), %rsi
Source line: 1
    movabsq $jl_apply_generic, %rax
    movabsq $4591455440, %rdi       ## imm = 0x111AC0CD0
    movl    $3, %edx
    callq   *%rax
    jmpq    L190
L176:   leaq    -40(%rbp), %rsi
    movq    %rbx, %rdi
    movl    $2, %edx
    callq   *(%rbx)
L190:   movq    -56(%rbp), %rcx
    movabsq $jl_pgcstack, %rdx
    movq    %rcx, (%rdx)
    addq    $40, %rsp
    popq    %rbx
    popq    %r14
    popq    %r15
    popq    %rbp
    ret
L218:   movabsq $jl_undefined_var_error, %rax
    movabsq $13174591616, %rdi      ## imm = 0x311445080
    callq   *%rax

While it's possible we could do a better job with the checked version, you'll note that throwing errors on really basic integer operations is a performance trap, even for things like ^. People write x^2 all the time, expecting it to be equivalent to writing x*x.

That said, let's keep this issue focused on prime sieve performance.

Keno commented 9 years ago

@StefanKarpinski You forgot to import checked_mul. The checked version is worse, but it's not quite that bad ;):

julia> @code_llvm checked_sqr(2)

; Function Attrs: sspreq
define i64 @julia_checked_sqr_21097(i64) #-1 {
top:
  %StackGuardSlot = alloca i8*
  %StackGuard = load i8*, i8** @__stack_chk_guard
  call void @llvm.stackprotector(i8* %StackGuard, i8** %StackGuardSlot)
  %x = alloca i64
  store i64 %0, i64* %x
  %1 = load i64, i64* %x
  %2 = load i64, i64* %x
  %3 = call { i64, i1 } @llvm.smul.with.overflow.i64(i64 %1, i64 %2)
  %4 = extractvalue { i64, i1 } %3, 1
  %5 = load %jl_value_t*, %jl_value_t** @jl_overflow_exception
  %6 = xor i1 %4, true
  br i1 %6, label %pass, label %fail

fail:                                             ; preds = %top
  call void @jl_throw_with_superfluous_argument(%jl_value_t* %5, i32 1)
  unreachable

pass:                                             ; preds = %top
  %7 = extractvalue { i64, i1 } %3, 0
  call void @llvm.stackprotectorcheck(i8** @__stack_chk_guard)
  ret i64 %7
}
@code_native checked_sqr(2)
Filename: none
Source line: 1
Source line: 1
L51:L81:    .section    __TEXT,__text,regular,pure_instructions
    subq    $24, %rsp
    movabsq $__stack_chk_guard, %rax
    movq    (%rax), %rcx
    movq    %rcx, 16(%rsp)
    movq    %rdi, 8(%rsp)
    imulq   %rdi, %rdi
    jo  L51
    movq    (%rax), %rax
    cmpq    16(%rsp), %rax
    jne L81
    movq    %rdi, %rax
    addq    $24, %rsp
    retq
    movabsq $jl_overflow_exception, %rax
    movq    (%rax), %rdi
    movabsq $jl_throw_with_superfluous_argument, %rax
    movl    $1, %esi
    callq   *%rax
    movabsq $__stack_chk_fail, %rax
    callq   *%rax
ScottPJones commented 9 years ago

Maybe these two last very interesting and informative messages could be deleted from here, and reposted in #855, to keep this focused more on sieve performance?

ironman353 commented 9 years ago

I am a bit confused. A lot of discussion here is about improving isPrime function. I thought we want a better sieve. In that case a well-written sieve of Eratosthenes almost always beat sieve of Atkin. If we want to only count primes in a range rather than generating all primes, then there are very fast methods based on 'Meissel-Lehmer prime counting algorithm'. This discussion is about prime-sieve / prime count / primality testing? Someone please clarify.

VicDrastik commented 9 years ago

The original proposal was to improve support for basic number theory in primes.jl in core Julia. The discussion has been useful for that purpose, but an exotic item like counting primes in a range is better implemented in an add-on package.

danaj commented 9 years ago

ironman353: "...a well-written sieve of Eratosthenes almost always beat sieve of Atkins."

Completely agree, other than the name being Atkin.

For counting, Legendre and Meissel are relatively easy. Lehmer, LMO, and eLMO are a lot more code (there are at least two open source implementations available for each). I suspect it's all better put in a package, but that's up some someone else.

isprime (or is-prime or is_prime or is_prob_prime or ...) seems like a nice basic function to have, and something that ought to be fast for small inputs since it tends to often get put in a loop.

Ismael-VC commented 9 years ago

This is a very interesting read, I still don't understand all that you are saying, but I'd like to post here my implementation of the Atkin sieve:

Ported directly from the Wikipedia pseudocode:

I'm amazed to see that it performs so well for a naive implementation, and it made me won a little challenge at my school :laughing:

Tested at JuliaBox with version 0.4.0-dev+5491 and a limit of 10^8, it beats Base.primes in both time and space:

Ismael-VC commented 9 years ago

With a limit of 10^9 on a Mac and version 0.3.9 (just changed round(Int, sqrt(limit)) for int(sqrt(limit))), primes uses less memory and spends less time in the gc but it's slower:

julia> versioninfo()
Julia Version 0.3.9
Commit 31efe69 (2015-05-30 11:24 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> gc(); @time primes(1000_000_000);
elapsed time: 72.423236327 seconds (531889264 bytes allocated, 0.02% gc time)

julia> gc(); @time atkin(1000_000_000);
elapsed time: 27.908726228 seconds (2342278320 bytes allocated, 0.17% gc time)

I'm testing julia> gc(); @time primes(10_000_000_000);, I hope it doesn't die on me, I can't test more than 10^8 at JuliaBox without the kernel dying.

Ismael-VC commented 9 years ago

I'll study in order to understand primemask, here is my last test:

julia> gc(); @time primes(10_000_000_000);
elapsed time: 809.601231674 seconds (4890727200 bytes allocated, 0.00% gc time)

julia> gc(); @time atkin(10_000_000_000);
elapsed time: 332.286719798 seconds (160351721104 bytes allocated, 0.32% gc time)
pabloferz commented 9 years ago

I've been working on a faster sieve (as some of you may already know) in #12025 and I was wondering if there is really need for someone to have the primesmask function as it is or if it would be enough to have a wheel sieve as mentioned by @VicDrastik before.

If there is no need for primesmask to generate a mask that includes all the integer between 1 and n, then I can implement a much faster version of primes. That also applies to the function primes(lo::Int, hi::Int) that I implemented in #12025.