aleaxit / gmpy

General Multi-Precision arithmetic for Python 2.6+/3+ (GMP, MPIR, MPFR, MPC)
https://gmpy2.readthedocs.io/en/latest/
GNU Lesser General Public License v3.0
521 stars 86 forks source link

Speed of power function #290

Open wjamesjr opened 3 years ago

wjamesjr commented 3 years ago

` from time import time import gmpy2 from gmpy2 import mpfr import mpmath as mpm from mpmath import mpf import math

def primes_less_than(n): if n < 3: return [] if n < 4: return [2] if n < 6: return [2, 3] correction = (n % 6 > 1) n = {0: n, 1: n-1, 2: n+4, 3: n+3, 4: n+2, 5: n+1}[n % 6] sieve = [True] * (n//3) sieve[0] = False for i in range(int(n*0.5)//3+1): if sieve[i]: k = 3i+1 | 1 sieve[((kk)//3)::2k] = [False]((n//6-(kk)//6-1)//k+1) sieve[(kk+4k-2k(i & 1))//3::2 k] = [False]((n//6-(kk+4k-2k(i & 1))//6-1)//k+1) return [2, 3] + [3*i+1 | 1 for i in range(1, n//3-correction) if sieve[i]]

n = 300000 # bernoulli number wp = 4230387 # bit precision nn = 17574 # upper bound for primes primes = primes_less_than(nn) threshold = int(math.log2(n) * 2)

mpmath

mpm.mp.prec = wp

time1 = time() zeta_prod = f_one = mpf('1') for p in primes: pprec = int(wp - n * math.log(p, 2)) if pprec < threshold: break mpm.mp.prec = pprec p_mn = mpf(p) * -n mpm.mp.prec = wp zeta_prod = f_one - p_mn zeta = f_one / zeta_prod print("mpmath time:", time() - time1)

gmpy2

gmpy2.set_context(gmpy2.context()) gmpy2.get_context().emax = 1000000000 gmpy2.get_context().emin = -1000000000 gmpy2.get_context().precision = wp

time1 = time() zeta_prod = f_one = mpfr('1') for p in primes: pprec = int(wp - n * math.log(p, 2)) if pprec < threshold: break gmpy2.get_context().precision = pprec p_mn = mpfr(p) * -n gmpy2.get_context().precision = wp zeta_prod = f_one - p_mn zeta2 = f_one / zeta_prod print("gmpy2 time:", time() - time1)

compare the two zetas

zeta3 = mpfr(str(zeta)) gmpy2.get_context().precision = 53 zeta_diff = zeta3 - zeta2 print(zeta_diff)

""" results: mpmath time: 76.67497706413269 gmpy2 time: 177.52319884300232 3.416228330766826e-1273472

1273472 * log(10) / log(2) = 4230382.4 (vs. bit precision of 4230387) which indicates the two zetas are equal

mpmath.version '1.2.1' gmpy2.version() '2.1.0b5' gmpy2.mpfr_version() 'MPFR 4.1.0' """ `

I am developing a multiprocessing bernoulli number algorithm for python. At the moment I am using mpmath since the gmpy2 version was about twice as slow. Now that I am nearing completion I had the time to further research what the slowdown was. It seems that exponentiation by an integer is about 2.3 times slower using gmpy2 than mpmath. I have included a script that has code segments that closely matches the single threaded time dominate portion of the algorithm in both mpmath and gmpy2. I am fairly new to both mpmath and gmpy2 so there is a chance that I am not optimizing one or the other but the code is as fair to both as I know to make it. Is it expected for gmpy2 to be this much slower in this case? I was expecting the opposite. Thanks, Bill James

casevh commented 3 years ago

I fixed many performance issues with mixed operands in what will be released as 2.1.0b6 (or possibly as 2.1.0rc1). If you are running on Linux and can compile gmpy2 from source, can you try compiling the current master version from github? If you're using Windows, I'm still working on compiling binaries.

Can you attach a file containing the exact source code to test? I'll run it on both versions.

Let's assume that gmpy2 is installed. Then I'm not too shocked that mpmath.mpf can be faster than gmpy2.mpfr when precisions are large. They both utilize GMP for the high-precision arithmetic. The difference becomes the overhead (MPFR library should have less overhead but I introduced a significant regression that should be fixed in the current code) versus the efficiency of the algorithm (mpmath sometimes has different algorithms).

wjamesjr commented 3 years ago

bernoulli_gmpy2.py.zip

I am running on a mac and probably would have difficulty building it. I had deleted the gmpy version of the bernoulli code but I made a stab at regenerating it with the file attached. As in the text in the original post almost the entire difference in time between the mpmath times and gmpy2 times are in the pow calls. The timing in the comments of the full code are the times I get running the mpmath version versus the bernfrac call in mpmath. The attached file shows the times between bernfrac and the algorithm in the script which are very similar until multiprocessing kicks in.. The original post is for sure apples to apples. I am going to attach it too. test.py.zip

I tried augmenting the power calls with a simple knuth power tree exponentiation algorithm from rosetta code. It had a huge effect on the gmpy2 power (~ 30-40 %) but negligible effect on the mpmath power. That indicates something amiss with the gmpy2 implementation somewhere. See the two power calls with the tree-power replacement in the comments.

casevh commented 3 years ago

I reconstructed the code from your first message. The latest gmpy2 code isn't appreciably faster. I'm now curious...

casevh commented 3 years ago

I got it closer but mpmath still wins. The MPFR library implements an optimized (unsigned long) (unsigned long) function that I was not using. I added that special case and now the running times are 66 seconds for mpmath and 88 seconds for gmpy2. I had to rewrite one line of Python code as p_mn = 1 / (mpfr(p) n).

I have another optimization that I'll test over the next few days that might get them closer.

wjamesjr commented 3 years ago

That is already a big improvement. Looking forward to what you come up with in the end. I think mpmath is doing the exponentiation in floating point and a standard exponentiation by squaring. But I could be wrong. In any event the speedup that you are doing will no doubt benefit a lot of code. A number of algorithms that I have been experimenting with are dominated by similar power calls.

wjamesjr commented 3 years ago

I spent some time today and extracted the integer power code from mpmath and created a test script using it. I even was able to get it working with mpfrs as input and output. However, I lost all the speed importing and exporting mpfrs. But, at least the script shows how mpmath is doing it. I am attaching the test code below. I also had to use mpmath normalization in front of the power function to make it work for some mpfr input values. Evidently the power code does not like how the mpfrs are normalized??

pow_int.py.zip

BTW, is there a way to create a mpfr from a tuple (sign, man, exp)?? I'm sure that how I did it was very inefficient. I tested it with my bernoulli code and it ended up slower even though it worked.

wjamesjr commented 3 years ago

pow_int.py.zip

When I brought over the mpmath code, I left out a bit that appears to be significant. The attached version includes it. It still produces answers that are sometimes a bit off mpmath vs the mpfr version for the same precision. Any way all that I wanted was to show the mpmath code which it does. I doubt that this helps in any way.

wjamesjr commented 3 years ago

I decided to concentrate on a function that only handled integer powers and updated the code to have a function mpfr_pow_int_int(m, n, prec) which returns a mpfr that is m ** n where m and n are integers. I brought in a function from mpmath that creates a mpf tuple from a man, exp. This saved a step of converting the mpfr integer to mpf and normalization so that the mpmath code would accept it. I converted the first integer using man=first integer and exp = 0. I tested the function in my bernoulli code and it brought down the time from 2 to 1 to about 1.25 or 1.3 to 1. So far it has worked for all test cases that I have tried. I think that the overhead is in using mul_2exp to get the mpf value back to mpfr. The new code is attached.

pow_int.py.zip

wjamesjr commented 3 years ago

I plugged the mpfr_pow_int_int function into the original test code that I posted and got the below results: mpmath time: 74.97047686576843 gmpy2 time: 99.737459897995 -8.1239576158479399e-1273471

On the order of the same ratio as the last numbers that you posted. Not bad for a complete python side attempt.

wjamesjr commented 3 years ago

Added function mpfr_pow_int(m, n, prec) to accept mpfr parameter as base. Cleaned up code and made some changes needed for negative input and bases.

pow_int.py.zip

wjamesjr commented 3 years ago

I now think that the python pow functions that I created from the mpmath code are performance wise equal to the mpmath version. In fact it might be slightly faster. Now the remaining time difference is in the following line in the test code:

zeta_prod *= f_one - p_mn

if this line is commented out (in both places) from the test code below it is a virtual tie.

latest versions of the python replacement code and test code is below:

pow_int.py.zip pow_test.py.zip

wjamesjr commented 3 years ago

changing the line zeta_prod *= f_one - p_mn in the test code to: zeta_prod = zeta_prod - zeta_prod * p_mn I get the below results:

mpmath time: 47.429802894592285 gmpy2 time: 46.22661113739014 -8.1239576158479399e-1273471

With the inclusion of the mpfr_pow_int_int code based on the mpmath algorithm and the change of the line above I have reached equal performance. I still hope for a native gmpy2 power improvement to make things even better!

casevh commented 3 years ago

Nice work. Once you reach "large enough" precision it usually doesn't make much difference whether you implement an algorithm in Python or C. The MPFR library uses a similar approach as mpmath.

The improvement I was considering was to attempt to factor the exponent by 2, 3, or 5. If so, could the base be increased by a power of 2 (or 3 or 5) and still remain small enough to fit in an unsigned long. I've been too busy the past couple of weeks to look at it.

wjamesjr commented 3 years ago

I'm good for now, no need to rush a solution. I am refactoring my bernoulli code due to the changed line:

zeta_prod = zeta_prod - zeta_prod * p_mn

My parallel strategy was binary splitting initially, now I have a solution that breaks a list of primes into 4 chunks and allocate to four processes. I have a pure mpmath version working and it is about 4.5-5 times faster than the bernoulli call in mpmath. And you are right, for large values the python overhead is minimal to the run time. I am woking on the gmpy version now hoping to get similar results.

wjamesjr commented 3 years ago

I tested my latest gmpy2 bernoulli code with the 10 millionth bernoulli number last night. It completed in 6.5 hours, not bad. When mathematica broke the bernoulli record with this number in 2008 it took them almost exactly 6 days. Given that I am working on a quad core 2013 MacBook I am quite pleased with the results. If I had a faster laptop I would try for 100,000,000, the last published record that I am aware of. I'm not willing to sacrifice my laptop to 30+ days of fan screaming torture at this point. But thank you for gmpy2 development and the gap developers as well. What a monster tool you all have developed.

casevh commented 3 years ago

I have a 12-core server with some idle time. :)

wjamesjr commented 3 years ago

I uploaded the entire code directory that I test from. I am trying to create a working environment around gmpy2 that maximizes multi-processor use. So far, I have implemented multi-processor functions for pi, factorial, bernoulli (and stirling from bernoulli), e, catalan, euler, ln2, ln10, and apery. The two bernoulli functions are bernoulli that is locked in to four processes and is the one that I computed bernoulli(10000000) with. This morning I created a new version (bernoulli_parallel) that will use all cores available. The number of cores is set in line 95 and the bernoulli number is set in line 135. You may want to limit the number of cores since it will use up the entire computer. Feel free to experiment and use the code any way you want. I do it just for fun and have no restrictions of how anyone can use it. Let me know how it goes and if you find problems or areas to improve.

bflgmp.zip