Closed StefanKarpinski closed 8 years ago
Ping @aiorla since you seem interested in the primes functionality.
Thanks for the ping @StefanKarpinski! Actually I didn't look into primesmask
much when I did the factor
push, but looking at the things you say it seems that he's talking of a (really) old Julia, doesn't it? (now there is no loop running to n, we use Sieve of Atkin, falses(n)...)
PS: I'm in finals right now but when I finish (23rd of June) I will try to do a definitive primes.jl PR to have a competitive implementation of factor
(the main problem is currently the slow powermod
for Int128 ( #3036) that makes it unusable so I will try to fix it). Then I'll go to another (more relevant?) part of the code! (maybe I'll ask you for suggestions :smile:)
I'm pretty sure the comments were relevant to this version.
I'm sorry, I don't get it then. :disappointed:
In the current julia: all the for loops in the function run to 1:r
(where r = isqrt(n)
), it starts setting all the integers with a not-prime flag (primesmask(n::Int) = primesmask(falses(n))
) and the fact that large primes are 6k±1
it's implicit in the Sieve of Atkin.
Do you mean that he has implemented a different algorithm (I suppose that a version of the Sieve of Eratosthenes with those improvements) and it beats the master?
Stefan, I am the 'someone' who emailed last week about the sieve. I replied to your reply last Tuesday, but the email bounced and I was only notified about this yesterday. I emailed you yet again yesterday - still waiting for a reply from you. One (or both) of us are having email problems - if you didn't get my email from yesterday, please send me another working email address for yourself.
The purpose of my first email was not merely to speed up the sieve, but to improve primes.jl in other ways as well. The code I sent was just an initial proposal and thus a basis for discussion.
@VicDrastik Why not just submit a PR right here - which will certainly be well received and also get feedback from others?
@VicDrastik, sorry about the email issues – whether they are on your end or mine. Since email doesn't seem to be a reliable medium of communication in this case, would you mind submitting a pull request? Can you address some of @aiorla's confusion about which version of the number sieve you improved?
Since there is so much interest in the code, here it is
function e4(s::AbstractVector{Bool})
n=length(s); s[:] = true; s[1] = false; s[4:2:n] = false
for p = 3:2:isqrt(n) if s[p] s[p*p:2p:n] = false end end
end
And here is the fastest version:
function e5(s::AbstractVector{Bool})
n = length(s)
s[:] = false
s[2] = true
s[3] = true
s[5:6:n] = true
s[7:6:n] = true
for p = 5:2:isqrt(n) if s[p] s[p*p:2p:n] = false end end
end
Feel free to make any use of this you like - the only new idea here is that in e5() I first set all flags to composite, then set the flags for possible primes to true; the opposite of the usual method as exemplified by e4() above.
But primes.jl does not export the sieve function primesmask() and I think it should. In fact, primes.jl should provide 2 sieve functions - a standard sieve as above, but starting at 'a', and running from a+1 to a+n, and a wheel sieve, which is a much more compact sieve. Storing a list of all 32-bit prime takes about 811 MB if they are stored as integers, 537 MB if stored as a binary sieve as above, but only 143 MB if stored as a wheel sieve. Binary sieves are useful when working with composites, but wheel sieves are more useful if working with primes, so it would be nice if Julia provided both.
Since Stefan is happy to discuss this topic publicly, here is the second half of my second email to him:
Ideally, I would like to re-design and re-write primes.jl entirely. At present, primesmask() is not exported, but I need something like it for my work. I would like to have a sieve function that starts not at 1, but at 'a', and sieves the integers a+1:a+n i.e. primesmask(s,a) returns a sieve of length n starting at base a+1.
Instead of factor(n), which returns a dictionary of the factors (and their multiplicities) of n, I would like a function called trialdivision(n,limit=isqrt(n)) which simply returns a (sorted) array of prime factors of n less than or equal to limit. If limit is omitted, we just get a list of all (small) prime factors of n, which can then trivially be used to create the same dictionary that factor() now returns. Trialdivision() would not call primes(), but instead divide first by 2 and 3, and then by 6k-1 and 6k+1 for k=1,2,3,.... This should be both simpler and faster than the current function.
isprime() also needs improvement. First, it should do trial division up to some limit (usually n^(1/4)). If n < 2^32 , isprime() now does Miller-Rabin with the 3 bases 2,7, and 61, but I have found a way to do it with only 2 M-R tests.
Algorithm
[1] do a M-R base 2 test
[2] do a M-R test with the smallest odd number with the opposite Jacobi symbol to 2 i.e. use smallest odd a such that J(2,n)*J(a,n) = -1 (one of these will be a quadratic residue mod n, and the other will not (this is actually not quite correct, but the idea works!) )
[3] This leaves only 42 pseudoprimes. 13 of these are of the form (x+1)(2x+1) and 19 are of the form (x+1)(4x+1) which can be quickly tested for with a single square root ( that reminds me : we also need to add a perfect square test to primes.jl ( isperfectsquare(n) = (n==(isqrt(n)^2) ) )
[4] This leaves only 10 composite values ( 953_2381, 601_100801, 109_541_2269, 6703_53617, 4723_113329, 7369_93329, 421_701_2381, 14621_87721, 151_751_28351, 13381*240841) which can be eliminated using gcd or simply by dividing by the smallest factor of each or even by simple table search.
The same trick can be used to reduce the number of M-R tests for 64-bit integers from 7 to 6, but I would not do it that way. Khashin, in his 2013 paper (Counterexamples for Frobenius primality test, arxiv.org/abs/1307.7920 ) has shown that a single Frobenius test (which takes the same time as 3 M-R tests) has no pseudoprimes less than 2^60. I have run this test on Feitsma's list of M-R(2) 64-bit pseudoprimes and found no counterexample. So the algorithm would be:
[1] trial divide up to some optimal value (at least 17) [2] do a M-R base 2 test [3] do a Frobenius test
In the worst case, this takes 4 M-R, rather than 7 M-R as in the current isprime().
I would be happy to do this work myself if you like.
I'm not an expert in number theory or a competent julia programmer but since I've recently spent some time in primes.jl
I would like to post my position about some of the ideas you propose.
primesmask()
/ primes(n)
primesmask
(especially -> primesmask(n)
).e5
is faster than the current implementation (a Sieve of Atkin), but I haven't tested it so maybe I'm mistaken.primesmask(a,n)
from a to a+n would be a nice addition.factor(n)
factor n
.factor(n)
does not call primes(K)
in every call, it uses some precomputed primes (PRIMES
) to get rid of factors smaller than 2^16. For composites with the smallest factor bigger than 2^16 it uses the Pollard's rho algorithm (pollardfactors!
) as explained in the code. I'm sure that it can be improved (as I said in a previous comment) but I think that it's much better than trial division. (at least for not-small number)isprime(n)
I am not familiar with Julia. Can you please post some benchmark results of your prime-sieve algorithm? Usually a good segmented sieve takes O(n log log n) operations and uses O(sqrt(n)) space. Pi(n) = No. of primes generated for the limit n Run-time of the java code for generating all the primes up to a limit, one that I wrote a few days back: Pi(10^7) = 664579 in 0.046 seconds Pi(10^8) = 5761455 in 0.240 seconds Pi(10^9) = 50847534 in 2.313 seconds Pi(10^10) = 455052511 in 27.612 seconds Pi(10^11) = 4118054813 in 358.159 seconds Pi(10^12) = 37607912018 in 5376.441 seconds
Ok, so I was curious and did a bit of testing and I was quite wrong about our sieve speed. (as I said I was maybe mistaken :disappointed:)
e5(10^7) => 0.055 s - julia = 0.236 s e5(10^8) => 0.7 s - julia = 4.96 s e5(10^9) => 8.443 s - julia = 62.35 s
@VicDrastik, I'm in favor of your proposed improvements – a pull request would be great. Let us know if you have any issues or problems.
Here is an improved version of segmented sieve (along with the benchmark). It works fine for any limit up to at least 10^12. @StefanKarpinski, @VicDrastik, @aiorla, Any idea for further improvement will be appreciated.
function Eratosthenes_1(MAX::Int)
pi::Int64 = 0;
isPrime::Array{Bool} = trues(MAX)
j::Int = 0
SQRT_MAX::Int = isqrt(MAX)
for i::Int = 2 : SQRT_MAX
if isPrime[i]
j = i * i
while j < MAX
isPrime[j] = false
j += i
end
end
end
for i::Int = 2 : MAX - 1
if isPrime[i]
pi += 1
end
end
return pi
end
function smallPrimes(B::Int64)
MAX::Int = isqrt(B) + 1
pi::Int = 0;
isPrime::Array{Bool} = trues(MAX)
primes::Array{Int64} = zeros(5150)
j::Int = 0
SQRT_MAX::Int = isqrt(MAX)
for i::Int = 2 : SQRT_MAX
if isPrime[i]
j = i * i
while j < MAX
isPrime[j] = false
j += i
end
end
end
for i::Int = 2 : MAX - 1
if isPrime[i]
pi += 1
primes[pi] = i
end
end
primes[5150] = pi
return primes
end
function countLargePrimes(A::Int64, B::Int64)
primes = smallPrimes(B)
largePrimes::Array{Bool} = trues(B - A + 1)
lim::Int64 = isqrt(B) + 1
p::Int64 = 0
s::Int64 = 0
q::Int64 = primes[5150]
count::Int64 = 0
for i::Int64 = 1 : q
p = primes[i]
if p > lim
break
end
s = floor(A / p)
s *= p
for j::Int64 = s : p : B
if j < A
continue
end
largePrimes[j - A + 1] = false
end
end
for i::Int = 1 : q
p = primes[i]
if p > B
break
end
if p >= A && p <= B
count += 1
end
end
for i::Int = 1 : B - A + 1
if i + A <= q
continue
end
if largePrimes[i]
count += 1
end
end
return count
end
function full_sieve(MAX::Int64)
if MAX <= 1000000
return Eratosthenes_1(MAX)
end
actualLimit::Int64 = MAX
count::Int64 = 1
sep::Int = 50000
limit::Int64 = floor(actualLimit / 100000) * 100000
smsize::Int = isqrt(limit + 1)
sievelimit::Int = isqrt(smsize)
smallsieve::Array{Bool} = trues(smsize + 1)
i::Int64 = 0
j::Int64 = 0
ptop::Int64 = 0
p::Int64 = 0
offset::Int64 = 0
step::Int = 5
i = 4
while i <= smsize
smallsieve[i] = false
i += 2
end
i = 3
while i <= sievelimit
if smallsieve[i]
j = i * i
while j <= smsize
smallsieve[j] = false
j += (i + i)
end
end
i += 2
end
ptop = 1
primes::Array{Int} = zeros(smsize + 1)
index::Array{Int64} = zeros(smsize + 1)
primes[1] = 2
index[1] = 2
for idx::Int64 = 3 : smsize
if smallsieve[idx]
ptop += 1
primes[ptop] = idx
index[ptop] = (idx * idx) >> 1
end
end
bigsieve::Array{Bool} = trues(sep)
while offset < limit
if offset % 1000000000 == 0
println(offset)
end
fill!(bigsieve, true)
for idx::Int = 2 : ptop
if index[idx] < sep
j = index[idx]
bigsieve[j + 1] = false
p = primes[idx]
j += p;
while j < sep
bigsieve[j + 1] = false
j += p;
end
index[idx] = j;
end
end
for idx::Int64 = 0 : step - 1
i = idx
if (2 * i + 1) % 5 == 0
continue
end
#prime::Int64 = 0
while i < sep
if (bigsieve[i + 1]) && ((offset + i >> 1 + 1) < limit)
# prime = offset + (i >> 1) + 1
count += 1
end
i += step;
end
end
for idx::Int64 = 1 : ptop + 1
index[idx] -= sep;
end
offset += (sep + sep)
end
if actualLimit > limit
count += countLargePrimes(limit + 1, actualLimit)
end
return count
end
@time full_sieve(10000000000);
# include("E:/Julia/full_sieve_2.jl")
# Pi(10^7) = 664579 in 0.108 seconds
# Pi(10^8) = 5761455 in 0.329 seconds
# Pi(10^9) = 50847534 in 2.726 seconds
# Pi(10^10) = 455052511 in 29.697 seconds
# Pi(10^11) = 4118054813 in 387.659 seconds
# Pi(10^12) = 37607912018 in 5783.603 seconds
@VicDrastik,
You can do deterministic 32-bit primality testing with a single M-R test using a small table of 256 shorts. Alternately just use BPSW which will be 1 Selfridge for most numbers, 2.5-3 for primes (vs. 1 for all numbers with the hash, or 1 for most / 2 for primes with yours, plus the tests for counterexamples).
Khashin's paper is interesting, but be aware that his Frobenius test doesn't obviously match either Grantham or Crandall/Pomerance. I'm quite interested if you can explain the mapping or show the code you used. There are certainly Frobenius pseudoprimes based on Grantham, C&P, etc. But that's really a digression, as there is a better solution.
Use the strong, "almost extra strong", or extra strong Lucas test after a M-R base-2 test. This is the BPSW test that has been in use since 1980. It's the one used by Pari, Maple, Mathematica, Sage, FLINT, Perl/ntheory, and others. It is deterministic to 2^64 and will be faster than the Frobenius test. The total cost is ~ 2.5 - 3 Selfridges (1 for the M-R test, 1.5 - 2 for the Lucas test). It also has the advantage of having much more use behind it in other software compared to Khashin's Frobenius test.
Some software likes to add some extra tests for >64-bit. Mathematica adds a M-R base 3, Perl/ntheory adds 1-5 random-base M-R tests depending on size. It's debatable how much that really buys you, considering nobody has yet found the first BPSW counterexample (even when weakening the test by removing the "strong" part of both tests).
ironman353, your sieve's benchmarks are much slower than mine. Here are the timings on a sieve of size 160,000,000:
julia> @time e5(s) elapsed time: 5.289495638 seconds (24260 bytes allocated)
julia> @time Eratosthenes_1(n) elapsed time: 15.940241054 seconds (180459292 bytes allocated, 0.65% gc time)
julia> @time Eratosthenes_1(n) elapsed time: 15.909622857 seconds (180000436 bytes allocated, 0.66% gc time)
julia> @time Eratosthenes_1(n) elapsed time: 15.95310164 seconds (180000436 bytes allocated, 0.63% gc time)
Takes 3 times as long, and uses lots more space.
danaj,
When you say "small table of 256 shorts" I presume you are referring to the paper of Forisek and Jancina. But "small" still is 512 bytes, which is a high price to pay for a tiny increase in speed. And this speedup only occurs for the primes (and a handful of 2-pseudoprimes), which comprise less than 5% of 32-bit integers. Not worth it.
BPSW is harder to code correctly, and slower to run than my idea. Also, you need to add a squareness test before you do the Lucas part. Again, not worth it.
You said "his Frobenius test doesn't obviously match either Grantham or Crandall/Pomerance" - that is actually the point. Grantham et. al. give upper bounds on the error rate of their tests ( 1/7710 for Grantham, for example) and hope for the best. Khashin's error rate is provably zero for n<2^60 and he implies that it is also probably zero for some larger n.
If you want to check Khashin's test yourself, download Feitsma's list of pseudoprimes from here http://www.cecm.sfu.ca/Pseudoprimes/index-2-to-64.html or a much shorter list of strong MR(2) pseudoprimes from here https://miller-rabin.appspot.com/
You will need code to find the Jacobi symbol - here is mine:
function jacobisymbol(a::Integer,n::Integer) if a==0 return 0 end
j=1 if a < 0 if (n&0b11)==3 j = -1 end a = -a end
if a>=n a = mod(a,n) end # while a != 0 r = abs((n&0b111)-4) while (a&1)==0 a = a>>1 if r==1 j = -j end end # a,n = n,a # if (a&0b11)==3 if (n&0b11)==3 j = -j end end a = mod(a,n) end # if n==1 return j else return 0 end end
Yes, you are right that BPSW is widely used, but that is only because it seems to work. Nobody knows why - the best explanation is that the pseudoprimes for the two parts of the test seem to fall into different residue classes, so that even though the MR test has many pseudoprimes, and the Lucas test even more, they have none in common. It has been conjectured that BPSW pseudoprimes are dense, although, to date, not one has been found. I would wager a large sum that the number of Frobenius pseudoprimes for n<2^128 is less than or equal to the number of BPSW pseudoprimes.
"Some software likes to add some extra tests for >64-bit. Mathematica adds a M-R base 3" - do you know why? Because it is necessary to exclude the squares of the Weiferich primes like 1093^2 and 3511^2 which are MR(2) pseudoprimes. But they do it using an extra MR test because it seems to work for small n, and they do not fully trust the BPSW algorithm.
VicDrastik, yes Forisek and Jancina published this method. There were 2-base and 3-base hashed solutions for 2^49 and 2^64 respectively released earlier (Worley, Berg, Jacobsen). If you think the 512 byte table isn't worth it, that's fine -- I just wanted to let you know about it since you seemed interested in reducing the number of tests.
What do you mean by: "Also, you need to add a squareness test before you do the Lucas part. Again, not worth it." All the definitions and theorems from Khashin's paper have something like "...that it is not a perfect square..." in them. You need a perfect square test to use Khashin's results.
I can't say whether a Lucas test is harder or not, because I still have never seen an implementation of Khashin's method. To my knowledge nobody has used it. You say you have code and that it's much easier than a Lucas test, but only give a Jacobi function. Are you using definition 1.1, doing a perfect square test followed by comparing (1+sqrt(c))^n equiv (1-sqrt(c)) mod n? Definition 1.3 which has 3 variables all unspecified (other than (c/n)=-1? Remark 1.4? The theorems in section 2 all involve a known factor of n so are just used for computational results.
You say "Khashin's error rate is provably zero for n<2^60". The paper doesn't say that. Read the paragraph in the conclusions -- all those results only apply when c < 128, and he hasn't identified numbers where that holds other than to say they are rare. You're left with a test that seems awfully good for small numbers, is probably good for larger ones but we don't know, and has never been used by anyone other than the author. How is this better than the Frobenius-Underwood test which is faster and has zero counterexamples to 2^50 (with none known above), or BPSW with none to 2^64?
Grantham's bounds are proven bounds, and also apply to his QFT which is much more involved (e.g. step 1 is trial division up to 50,000). Khashin gives nothing here. I don't understand how you can compare the QFT 1/7710 bound to what Khashin shows in his paper.
BPSW working: "Nobody knows why" The 1980 paper makes it pretty clear that the numbers tend to lie in different residue classes. Unless you've worked for Wolfram, neither of us actually know why Mathematica adds the base-3 MR test. It's not necessary for Weiferich primes if a strong Lucas-Selfridge test is done. As for their early code, Pinch's 1993 paper is pretty damning, with incorrect implementations and custom solutions responsible for lots of problems. It could be that they do the base-3 test because their Lucas test is still borked (dubious). Or perhaps they updated it and added the extra base as a hedge against something happening again. Their notes still say "As of 1997, this procedure is known to be correct only for n < 10^16 [...]".
@VicDrastik, full_sieve() is the original function. When the limit is smaller than 10^6, Eratosthenes_1() is called by full_sieve(). Timings of full_sieve() function are
Pi(10^7) = 664579 in 0.108 seconds Pi(10^8) = 5761455 in 0.329 seconds Pi(10^9) = 50847534 in 2.792 seconds Pi(10^10) = 455052511 in 29.697 seconds Pi(10^11) = 4118054813 in 387.659 seconds Pi(10^12) = 37607912018 in 5783.603 seconds
So it is faster than the function e5()
full_sieve() function is written in the lower part of my previous comment. While running the code, you have to use this function to get better timing.
@ironman353 sorry I'm on exams and I have not much time to think about it, just a small comment:
floor(sqrt(x)) == isqrt(x)
.When I finish I will gladly read the code. (if this issue is still open)
@aiorla, thanks for the suggestion. I just started learning julia, so I didn't know about the function isqrt(). I shall be glad if you check the code after your exam.
Let's not confuse algorithms for 32-bit integers with those for 64-bit integers: Khashin's test of course works for 32-bit integers, but it is overkill to use it, and the same goes for BPSW. We can get away with using only M-R for small n i.e. both Khashin and BPSW should only be used for large n. There are 2314 MR(2) 32-bit spsp's, and any other (fixed) MR test reduces this to between 70 and 110 spsp's, normally needing yet another test to clean up (e.g. bases 2,7, and 61). My quadratic residue idea produces only 42 spsp's, but 32 of these can be quickly eliminated, and the remaining 10 can be zapped either using a table or simply 10 (fixed) trial divisions.
Definition 1.1 of Khashin is the correct one (the others are just used in the proof, not in the actual algorithm), so
let c be an odd positive number (does not need to be a prime, but it always will be) let d = sqrt(c) let j = jacobisymbol(c,n)
Then, if n is prime, (1+d)^n === (1+j*d) (mod n)
Khashin then proves that if we choose the smallest odd c such that j==-1, then there are no Frobenius pseudoprimes less than 2^60 (assuming gcd(2.3.5.7.11.13.17,n)=1). It is also required that c<128 but this turns out to be only of theoretical interest.
I have to concede your point that using the word "provably" is too large a claim. Also, that he does not explicitly claim this unconditional result. When I read Khashin's paper, I just automatically assumed he would always require a MR(2) test first and I immediately tested whether c<128 for base 2 strong pseudoprimes. Here is my code and the results:
function _()
f=open("2-SPRP-2-to-64.txt","r")
g=open("output.txt","w")
count=zeros(Int,999)
for line in eachline(f)
n=uint64(line)
if n==3511^2 continue end
if n==1093^2 continue end
for c=3:2:999
if jacobisymbol(c,n)<1
count[c]+=1
break
end
end
end
for i=1:999
if count[i]>0 println(g,i," , ",count[i]) end
end
println(g,sum(count)+2)
close(g)
close(f)
end
3 , 11515748 5 , 7836069 7 , 5608159 11 , 3275113 13 , 1727914 17 , 926810 19 , 487767 23 , 254188 29 , 130373 31 , 65303 37 , 33069 41 , 16793 43 , 8289 47 , 4218 53 , 2097 59 , 1016 61 , 548 67 , 276 71 , 138 73 , 59 79 , 32 83 , 18 89 , 7 97 , 4 101 , 2 103 , 1 107 , 1
31894014
so, the largest c value needed is c=107. So, yes, this is not a "proof", because it depends on computed results rather than algebra. But, assuming Feitsma's table and my code are correct, this covers all pseudoprimes up to 2^60. Naturally, it does not cover composites up to 2^60 which are not MR(2) pseudoprimes, but it doesn't have to if you do a MR(2) test first.
Several years ago, before Khashin's paper, I had a lengthy exchange of emails with Paul Underwood about his Frobenius-style test. Writing them both in a common notation, we can see the similarities
Underwood-Frobenius: for any integer p such that jacobisymbol(p^2-4,n)==-1, (2+x)^(n+1) === 5+2_p mod(x^2-p_x+1),(mod n)
Khashin-Frobenius: for the smallest odd integer c such that jacobisymbol(c,n)==-1, (1+x)^(n+1) === 1-c mod(x^2-c),(mod n)
Underwood's test is not faster (if anything, it has to be slightly slower). I suggested to Paul that he should run his test on Feitsma's Fermat(2) database, and he found no counterexamples, so we should not be surprised when Khashin's version also passes.(when I have time, I will run Khashin's test on Feitsma's full data set, not just the strong pseudoprimes)
Grantham's seminal paper was great, but several authors(including Khashin) have since suggested that his 1/7710 bound was far too pessimistic.
The only modern test that can be compared to Khashin's is Loebenberger's version of the Crandall/Pomerance Frobenius test. Here is that algorithm in Python:
def frobenius(n):
""" returns True if n is probably prime
if n % 2 == 0 or n < 2 then return False
#step 1
while 1:
a = randint(1, n-1)
b = randint(1, n-1)
d = a ** 2 - 4 * b
if not (is_square(d) or gcd(2 * a * b * d, n) != 1) then break
#step 2
m = (n - jacobi(d, n)) // 2
w = (a ** 2 * inverse(b, n) - 2) % n
u = 2
v = w
# step 3
for bit in bits_of(m):
if bit:
u = (u * v - w) % n
v = (v * v - 2) % n
else:
v = (u * v - w) % n
u = (u * u - 2) % n
# step 4
if (w * u - 2 * v) % n != 0 then return False
return (pow(b, (n - 1) // 2, n) * u) % n == 2
The Lucas part of this takes only 2 selfridges, rather than 3 for Khashin/Underwood, but the parameters are random, and so finding pseudoprimes is not possible.
I suspect a new issue should be here for isprime, as sieving is really another subject. I strongly suspect a well-written SoE will beat the SoA by a wide margin. Optimizing the SoA is hard and makes very unreadable code, and is still slower than a SoE with equally complex code.
@VicDrastik Making the reasonable assumption that we want to simplify 32-bit for performance rather than using the same test, how about this:
x = n;
x = (((x >> 16) ^ x * 0x45d9f3b) & 0xFFFFFFFFUL;
x = ((x >> 16) ^ x) & 15;
b) use it to index into a table of bases:
static const uint16_t mr_bases_second[16] =
{2249,483,194,199,15,369,499,945,419,735,33,471,946,615,497,702};
c) M-R( mr_bases_second[x] )
Two bases, deterministic for all 32-bit numbers with only 16 small bases in a table, and no exceptions cases (no requirement for any trial division either).
For 64-bit, step 3 is some other test. I prefer a (strong,AES,ES) Lucas test making BPSW, you prefer Khashin's Frobenius test. For 64-bit inputs they're equally good (anything will work here as long as you run it through the SPSP-2 database), so now we're debating performance and coding complexity for 64-bit, and performance + probability for larger (er...are we discussing larger than 64-bit here, yes? If not then almost all of this discussion is moot -- just use what passes the SPSP-2 database and runs fastest).
On to Khashin. I wrote a version last night that works with definition 1.1, using binary ladder exponentiation and treating the multiplication similar to complex numbers. It's slower than a Lucas test, but the code could be optimized, and Montgomery math would speed it up a bit. Improvements or suggestions welcome.
int k = 0;
UV c = 1;
if (is_perfect_square(n)) return 0;
while (k != -1) {
c += 2;
k = kronecker_uu(c, n);
}
/* Next line isn't necessary */
if (!is_prob_prime(c)) croak("Khashin invalid c for %lu\n", n);
{
UV ra = 1, rb = 1, a = 1, b = 1, k = n-1;
while (k) {
if (k & 1) {
UV ta=ra, tb=rb;
ra = addmod( mulmod(ta,a,n), mulmod(mulmod(tb,b,n),c,n), n );
rb = addmod( mulmod(tb,a,n), mulmod(ta,b,n), n);
}
k >>= 1;
if (k) {
UV ta=a, tb=b;
a = addmod( sqrmod(ta,n), mulmod(sqrmod(tb,n),c,n), n );
b = mulmod(ta,tb,n);
b = addmod(b,b,n);
}
}
return (ra == 1 && rb == n-1);
}
To be faster than Underwood's test it would need to be written differently. His test uses only 2 full mulmods per bit. AES Lucas uses only 2, Lucas with Q=+/-1 uses 2. This uses 4+. This is a simple literal translation of (1+1D)^n mod n [D=sqrt(c)], so could likely be improved.
The count of c values over SPSP-2s I get is similar to yours, though not exact. Your Jacobi function has two issues. The first and most important is that the while (a is even) loop needs to check n%8==3 and n%8==5. You're only testing the latter. The second issue might be irrelevant in your language, but be careful of taking in signed integers when the input can be 64-bit unsigned.
I'm perfectly fine with saying "assuming the Feitsma/Galway database is correct" -- I certainly use that assumption. The math involved in restricting the candidates has been double checked, though that still leaves the computation.
I've run the Frobenius-Khashin code above over the full PSP database. It only took 15 minutes to run. No prefilters other than n%2.
Re Paul Underwood, His minimal lambda +2 test was first posted on 30 Apr 2012. The updated version on 21 May 2013. 5 days after his new test was released I had tested over both Feitsma databases and notified him. I released it in my software in June 2013.
I released a version of Loebenberger's modification to the C/P Frobenius in January to my GMP project. It's just used internally however.
I think the point of the F-U test is that you can use it without a Miller-Rabin test in front, hence the overall speed is comparable to BPSW (sometimes faster) and quite a bit faster than a standard Frobenius test. The problems I see are:
Basically I have the same objections to Khashin's test as I do to F-U. If done with a M-R base 2 test in front then (2) and (3) still apply. For some projects that want more certainty (my prime gap verifier for instance) I've used BPSW followed by F-U. If we use C/P or Loebenberger style, then I'd say use the Selfridge parameter selection so we get BPSW with the extra check (Frobenius) at the end. That gives you some extra security at little cost.
While this conversation is fascinating, I feel like it might be better-suited for the julia-users mailing list. I'm not aware of any applications for which having primality testing as part of the standard library is an absolute requirement when it could just as well be implemented in a package, but correct me if you know of some. Perhaps the gmp integration calls for having at least a basic implementation. A PR to improve what we have right now would be welcome. Number-theory-related packages can go ahead and implement all sorts of different algorithmic variants, if there isn't consensus on a single best choice.
Your 32-bit hash idea is just what we need - it is simpler, faster, shorter and although its table uses 12 extra bytes more than mine, the saving in code length is surely far more than this, so it is better in every way. Pure gold!
I wonder if we can squeeze even more performance out of this idea - using just a single base 2 MR test, we can correctly test up to n=2046. If we did a few trial divisions and then used a hash with MR 16 bases, how large could n be? 2^16, maybe?
My reason for liking Khashin's Frobenius test is that it has strong theoretical justification and so can be relied upon for n>2^64 more than the older methods. But, as you point out, it is clearly at least 50% slower for n<2^64 (on primes, that is). I have just realised that for n>2^64, Julia will promote to BigInts and use external libraries, and so performance will be very poor for any method, including Khashin. So your suggestion that Loebenberger's version of C/P can be equivalent to BPSW just by using Selfridge parameter selection sounds good - do you mean that we choose b=2, d=a^2-4b,j=jacobisymbol(d,n) and then look for a value for a which makes j==-1 ? If so, this is definitely simpler and faster for n<2^64.
You are right about Montgomery modular multiply - eventually it will be necessary to use this to avoid automatic promotion to BigInts.
About the Jacobi symbol - first, I was not counting j==-1, but j<1, because j==0 immediately shows that n is composite. Second, my Jacobi code is OK; in the "a is even" loop, it tests both cases with a single comparison because r is defined such that r==1 covers both n%8==3 and n%8==5. (r=abs((n&0b111)-4))
Good to know that Khashin-Frobenius works on Fermat psp's and not just MR spsp's.
@VicDrastik Where do you see julia promoting to BigInt
?
I get the following pretty strange results (no promotion at all, and results going from positive to negative)
julia> 2^63
-9223372036854775808
julia> 2^62
4611686018427387904
julia> 2^64
0
julia> Int128(2)^64
18446744073709551616
julia> Int128(2)^128
0
julia> Int128(2)^127
-170141183460469231731687303715884105728
Julia doesn't promote to BigInt
, and the results you are getting are from integer overflow (and aren't strange, or at least are well-defined), e.g.
julia> typemin(Int64) == 2^63
true
@VicDrastik We can shorten to only 11 values (use mod 11 in the hash instead of & 15): 12829, 8621, 11795, 1994, 4190, 3407, 603, 627, 351, 11147, 10734 Of course double check once implemented! That is M-R base 2, then the selected second base. No trial division requirements necessary other than being odd.
I haven't found deeper trial division to be very helpful in making smaller hashes. It just doesn't remove enough values. As for trial division helping with an early exit, factoring to 53 (not unreasonable) gets you to 42799. You have to go up to 157 to get past 49141 and then you hit 104653. Then you need 229 to get you to 256999, and so on.
If you do trial factoring through 53 at the start, then you have your answer for all n < 3481 (59*59) without running a single M-R test. I'd then split off small numbers into a trial factoring loop (say below 500k). It all depends on relative speeds in your language/implementation, so the best cutoff will be different.
Re: Loebenberger's Frobenius, I was thinking of the Selfridge selection (method A from Baillie/Wagstaff 1980): P=1, D 5,-7,9,-11,13,... until (D|n)=-1, then Q=(1-D)/4. This should be able to then apply to Loebenberger with a=P,b=Q. For 64-bit you can stop after the Lucas test part. Double check that logic though. There are still pseudoprimes from the Lucas part of course, they just don't overlap with the M-R base 2 pseudoprimes under 2^64.
Leobenberger's theorem 6 is very similar to the conditions for an extra strong Lucas pseudoprime, and the V condition is mentioned in Baillie/Wagstaff as well as an extra test with low cost. You could also look into the "almost extra strong" and "extra strong" Lucas tests, which use the V condition, with Q=1; P=3,5,7,... until (D|n)=-1 to make a very fast implementation. This is used by Pari, among others (Pari increments P by 2, Baillie suggests by 1, I don't think it matters). These still have readily findable pseudoprimes so for small numbers seem less stringent than the Frobenius tests we've discussed (Underwood, Khashin, or (P,2)).
@IainNZ That's the sort of thing I'd have expected more from C, not from Julia, and I've already fixed one case where suddenly switching from positive to negative was deemed incorrect.
To me, it seems that this should get an InexactError
, instead of giving an incorrect result.
(which is what you get if you try to convert 128 to a Int8, for example).
Its essential for fast integer performance though, so that ship sailed a long time ago. Note that there is a subtle distinction between performing operations on a type, and converting between types. e.g.
julia> int(BigInt(2)^64)
ERROR: InexactError()
in convert at gmp.jl:132
in int at base.jl:108
julia> 2^64
0
Int64 is a well-defined thing, it just loops around. Mashing arbitrary size numbers into an Int64 doesn't go, you cannot perform the operation meaningfully, hence the error. At least thats my reasoning.
@IainNZ I just brought this up because of @VicDrastik's comment:
I have just realised that for n>2^64, Julia will promote to BigInts and use external libraries
which did not seem correct to me (and also, there is support for Int128
as well in Julia, does that end up using external libraries?)
I also worked for decades on a language where all overflows were reported as errors...
"I also worked for decades on a language where all overflows were reported as errors..."
" (I've been using and implementing structures [multidimensional associative arrays], for the past 29 years,"
"I've spent almost 30 years as one of the developers/architects for a language/database system where that is done all the time."
"I came up with my own packed scheme for storing Unicode data much more efficiently back 18 years ago..."
"Over the years (27 dealing with national character sets, 20 with Unicode) I did a lot of work doing just these sorts of operations..."
" Also, as a C programmer for 35 years, I disagree with saying that in common parlance string means NUL-terminated... "
How is this relevant? What should we do with this all this information about what @ScottPJones have done for the last x years.
Yes, let's avoid autobiographical information. It alienates some people and distracts from the technical discussions we need to have.
OK, forget the number of years that I've been dealing with these issues, that doesn't take away from the technical points I've been trying to illustrate, about whether languages can/should report overflows as errors, about associative array design, schemes for storing Unicode (look at SCSU
and BOCU-1
), issues with 8-bit, multibyte and Unicode, and use of NUL-termination in C code/libraries.
OK, forget the number of years that I've been dealing with these issues, that doesn't take away from the technical points I've been trying to illustrate,
But the issue is precisely that your tone does affect your ability to have effective technical conversations. People need to be excited to collaborate with you, which means that you need to write prose that doesn't upset anyone.
Fine, I'm trying to learn better how to interact here... It is frustrating though when people pop up with a many line comment just to criticize me personally, that has absolutely nothing to do with the technical discussion at hand...
The technical point I was making in that sentence that bothered @KristofferC, was that other languages do give an error, or promote to a larger size... Python for example:
>>> pow(2,64)
18446744073709551616
(We have been a little overly harsh on @ScottPJones, hopefully we can all find a better way of working together so people avoid getting on each other's nerves quite as badly.)
other languages do give an error, or promote to a larger size... Python for example:
Do any other languages do so and still maintain integer arithmetic performance comparable to C?
I'd like that!
I totally understand why overflow is not checked for the basic operations, however, exponentiation was something that I didn't really view as being in the same class as +
, -
, *
, /
, or <<
.
It's expensive enough, that I would have thought overflow checking would be something that could be done without a large % penalty. I also think it would be nice, somehow, to have a "checked" arithmetic mode... just as in julia normally it does do bounds checking, but you can skip it with @inbounds
.
It seems that all math in julia is currently unsafe
... (maybe we need add
and unsafe_add
, etc!)
@ScottPJones, see https://github.com/JuliaLang/julia/issues/855
Compare unchecked code for x^2
versions checked code for x*x
(not even checked x^2
):
julia> sqr(x) = x^2
sqr (generic function with 1 method)
julia> @code_native sqr(123)
.section __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
pushq %rbp
movq %rsp, %rbp
Source line: 1
imulq %rdi, %rdi
movq %rdi, %rax
popq %rbp
ret
julia> checked_sqr(x) = checked_mul(x,x)
checked_sqr (generic function with 1 method)
julia> @code_native checked_sqr(123)
.section __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 1
pushq %rbp
movq %rsp, %rbp
pushq %r15
pushq %r14
pushq %rbx
subq $40, %rsp
movq %rdi, %r14
movq $6, -64(%rbp)
Source line: 1
movabsq $jl_pgcstack, %rax
movq (%rax), %rcx
movq %rcx, -56(%rbp)
leaq -64(%rbp), %rcx
movq %rcx, (%rax)
xorps %xmm0, %xmm0
movups %xmm0, -48(%rbp)
movq $0, -32(%rbp)
movabsq $4619782280, %rax ## imm = 0x1135C4888
Source line: 1
movq (%rax), %rbx
testq %rbx, %rbx
je L218
movq %rbx, -48(%rbp)
movabsq $jl_box_int64, %r15
movq %r14, %rdi
callq *%r15
movq %rax, -40(%rbp)
movq %r14, %rdi
callq *%r15
movq %rax, -32(%rbp)
movq -8(%rbx), %rax
shrq $2, %rax
cmpq $1141923900, %rax ## imm = 0x4410603C
je L176
Source line: 1
leaq -48(%rbp), %rsi
Source line: 1
movabsq $jl_apply_generic, %rax
movabsq $4591455440, %rdi ## imm = 0x111AC0CD0
movl $3, %edx
callq *%rax
jmpq L190
L176: leaq -40(%rbp), %rsi
movq %rbx, %rdi
movl $2, %edx
callq *(%rbx)
L190: movq -56(%rbp), %rcx
movabsq $jl_pgcstack, %rdx
movq %rcx, (%rdx)
addq $40, %rsp
popq %rbx
popq %r14
popq %r15
popq %rbp
ret
L218: movabsq $jl_undefined_var_error, %rax
movabsq $13174591616, %rdi ## imm = 0x311445080
callq *%rax
While it's possible we could do a better job with the checked version, you'll note that throwing errors on really basic integer operations is a performance trap, even for things like ^
. People write x^2
all the time, expecting it to be equivalent to writing x*x
.
That said, let's keep this issue focused on prime sieve performance.
@StefanKarpinski You forgot to import checked_mul. The checked version is worse, but it's not quite that bad ;):
julia> @code_llvm checked_sqr(2)
; Function Attrs: sspreq
define i64 @julia_checked_sqr_21097(i64) #-1 {
top:
%StackGuardSlot = alloca i8*
%StackGuard = load i8*, i8** @__stack_chk_guard
call void @llvm.stackprotector(i8* %StackGuard, i8** %StackGuardSlot)
%x = alloca i64
store i64 %0, i64* %x
%1 = load i64, i64* %x
%2 = load i64, i64* %x
%3 = call { i64, i1 } @llvm.smul.with.overflow.i64(i64 %1, i64 %2)
%4 = extractvalue { i64, i1 } %3, 1
%5 = load %jl_value_t*, %jl_value_t** @jl_overflow_exception
%6 = xor i1 %4, true
br i1 %6, label %pass, label %fail
fail: ; preds = %top
call void @jl_throw_with_superfluous_argument(%jl_value_t* %5, i32 1)
unreachable
pass: ; preds = %top
%7 = extractvalue { i64, i1 } %3, 0
call void @llvm.stackprotectorcheck(i8** @__stack_chk_guard)
ret i64 %7
}
@code_native checked_sqr(2)
Filename: none
Source line: 1
Source line: 1
L51:L81: .section __TEXT,__text,regular,pure_instructions
subq $24, %rsp
movabsq $__stack_chk_guard, %rax
movq (%rax), %rcx
movq %rcx, 16(%rsp)
movq %rdi, 8(%rsp)
imulq %rdi, %rdi
jo L51
movq (%rax), %rax
cmpq 16(%rsp), %rax
jne L81
movq %rdi, %rax
addq $24, %rsp
retq
movabsq $jl_overflow_exception, %rax
movq (%rax), %rdi
movabsq $jl_throw_with_superfluous_argument, %rax
movl $1, %esi
callq *%rax
movabsq $__stack_chk_fail, %rax
callq *%rax
Maybe these two last very interesting and informative messages could be deleted from here, and reposted in #855, to keep this focused more on sieve performance?
I am a bit confused. A lot of discussion here is about improving isPrime function. I thought we want a better sieve. In that case a well-written sieve of Eratosthenes almost always beat sieve of Atkin. If we want to only count primes in a range rather than generating all primes, then there are very fast methods based on 'Meissel-Lehmer prime counting algorithm'. This discussion is about prime-sieve / prime count / primality testing? Someone please clarify.
The original proposal was to improve support for basic number theory in primes.jl in core Julia. The discussion has been useful for that purpose, but an exotic item like counting primes in a range is better implemented in an add-on package.
ironman353: "...a well-written sieve of Eratosthenes almost always beat sieve of Atkins."
Completely agree, other than the name being Atkin.
For counting, Legendre and Meissel are relatively easy. Lehmer, LMO, and eLMO are a lot more code (there are at least two open source implementations available for each). I suspect it's all better put in a package, but that's up some someone else.
isprime (or is-prime or is_prime or is_prob_prime or ...) seems like a nice basic function to have, and something that ought to be fast for small inputs since it tends to often get put in a loop.
This is a very interesting read, I still don't understand all that you are saying, but I'd like to post here my implementation of the Atkin sieve:
Ported directly from the Wikipedia pseudocode:
I'm amazed to see that it performs so well for a naive implementation, and it made me won a little challenge at my school :laughing:
Tested at JuliaBox with version 0.4.0-dev+5491
and a limit of 10^8, it beats Base.primes
in both time and space:
With a limit of 10^9 on a Mac and version 0.3.9
(just changed round(Int, sqrt(limit))
for int(sqrt(limit))
), primes
uses less memory and spends less time in the gc but it's slower:
julia> versioninfo()
Julia Version 0.3.9
Commit 31efe69 (2015-05-30 11:24 UTC)
Platform Info:
System: Darwin (x86_64-apple-darwin13.4.0)
CPU: Intel(R) Core(TM) i5-3210M CPU @ 2.50GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3
julia> gc(); @time primes(1000_000_000);
elapsed time: 72.423236327 seconds (531889264 bytes allocated, 0.02% gc time)
julia> gc(); @time atkin(1000_000_000);
elapsed time: 27.908726228 seconds (2342278320 bytes allocated, 0.17% gc time)
I'm testing julia> gc(); @time primes(10_000_000_000);
, I hope it doesn't die on me, I can't test more than 10^8 at JuliaBox without the kernel dying.
I'll study in order to understand primemask
, here is my last test:
julia> gc(); @time primes(10_000_000_000);
elapsed time: 809.601231674 seconds (4890727200 bytes allocated, 0.00% gc time)
julia> gc(); @time atkin(10_000_000_000);
elapsed time: 332.286719798 seconds (160351721104 bytes allocated, 0.32% gc time)
I've been working on a faster sieve (as some of you may already know) in #12025 and I was wondering if there is really need for someone to have the primesmask
function as it is or if it would be enough to have a wheel sieve as mentioned by @VicDrastik before.
If there is no need for primesmask
to generate a mask that includes all the integer between 1
and n
, then I can implement a much faster version of primes
. That also applies to the function primes(lo::Int, hi::Int)
that I implemented in #12025.
Someone emailed me directly with substantial performance improvements to our prime sieve implementation. Unfortunately, they didn't give permission to use the code under the MIT license and haven't replied to me, so I'm going to post the non-code parts of their email so that someone else who hasn't seen their code can do an independent implementation of these improvements.
It would have been nice to just have a pull request fixing this (with an implied MIT license) since this individual went to the trouble of making and benchmarking these improvements, but instead here we are.