faster-cpython / ideas

1.69k stars 49 forks source link

Faster power? #484

Closed jneb closed 2 years ago

jneb commented 2 years ago

I wrote a faster power routine for Python 3.8 that got rejected at the time with comment "too many lines of code, not enough speedup". But times have changed. See https://github.com/jneb/cpython/tree/38_fast_pow The idea is: currently, small powers are done with binary powering, and the large ones with a base 32 method. My method uses a method that uses base 2 up to 64 based on the size and bit patterns of the exponent. Is up to 15% faster in the "intermediate" powers, and about 2% for high powers. Of course, it probably needs adjusting for 3.11.

jneb commented 2 years ago

I found out I did already update it to 3.10 (and forgot I did), so that saves some work: https://github.com/python/cpython/pull/24206

jneb commented 2 years ago

For completeness, here's the relevant issue: https://bugs.python.org/issue42911 It ends with: Thank you for the suggestion and PR, but I'm going to close here; for me this ends up on the wrong side of the performance / complexity trade-off. Since the performance / complexity line has changed, please check on which side we are now.

gvanrossum commented 2 years ago

@MarkDickinson

sobolevn commented 2 years ago

I think you meant @mdickinson: https://github.com/mdickinson :)

mdickinson commented 2 years ago

Thanks for the ping, @jneb. My opinion hasn't changed since the https://github.com/python/cpython/issues/87077 discussion - for me, the extra complexity isn't worth it. This seems too niche an operation to be focusing on as part of the faster-cpython efforts, so I don't think that changes anything. I also unfortunately don't have the bandwidth to shepherd such a change through.

That said, this is really Tim Peters' code, so if you can interest @tim-one in the changes then that could be a path forward.

tim-one commented 2 years ago

The code in 3.11 is different in several respects:

None of those really increased the complexity of the code, though. Equivalent ways of doing much the same things, but more efficiently.

There's some potential in the code as-is for dynamically picking a maximum window size (the larger the exponent, the better a wider window "in theory"), but I didn't see any speedups for "sane" exponent sizes worth the extra bother (and so the max window size is fixed at 5 bits - but I didn't explore using smaller max window sizes for smaller exponents).

Attempts to go beyond this will likely be a hard sell. The code is already quite involved, and it's been tuned over a course of years. Any such attempt should redo timings, and, if they seem worth pursuing, presented as a GitHub pull request. But speedups under, say, 10%, probably aren't worth the bother unless they're trivial code changes.

jneb commented 2 years ago

So basically several improvements that I made (specifically the sliding window) are implemented now. Good to hear. To be honest, the rest of the optimization is quite marginal, as it involves the "midsized" expoonents. As most of the exponents in practice are either very small (which is fast already), or pretty large (for cryptography), there is not much of these in practice. So unless somebody comes with an example of a often used application that uses exponents of 10-30 digits, I agree that the extra effect of addition chains will not be much. So, even though it would be really cool to see some of my code in the Python code base :-), let's leave it where it is now.