Closed ckormanyos closed 1 year ago
Maybe worth reading: https://danlark.org/2020/06/14/128-bit-division/
Maybe worth reading...
Indeed it is. Thank you for providing this information.
This has also motivated a new issue #13 to potentially specialize mul/div for 128-bit uint with 32-bit limbs
Although only partly related to this issue, general rework and cleanup of the classic division algorithm has been mostly done in 2e9e8ce52a0cd619c96bf3c492685d017b223cb1
Based on discussions in #362, in particular here, this comment reminds of this old issue that is intended to track potential division optimization(s).
Thninking out loud...
I wonder if is worth-it to try unrolling four-by-n or eight-by-n division. This might lead to a lot more written code depth, but maybe not much reward in the area of speedup. Or perhaps this kind of optimization might yield more significant improvements.
Cc: @maksymbevza and @johnmcfarlane
It's really up to you, but you might want to look into the extent to which -O3 can unroll your loops. It's generally tidier to leave this kind of thing up to the optimiser. That means the user gets to choose: minor speedups at the expense of code size, or something a little more balanced...
A little bit of handwaving and some of the estimations from my experience.
When doing performance optimizations, I've seen that on the Intel x86_64 architecture (before Skylake) a lot of time is spent just waiting for div/idiv operation to finish, around 50%. Given a comparison of the speed for Intel processors, the speed of multiplication has a latency of 1 clock cycle, while division is 30-80 (64bit registers). With Skylake+ it has been decreased to ~20 and even ~10 or so for Adler Lake.
Given this, I suppose the unrolling for Intel processors might not have that much sense unless used with Adler Lake. Please correct me if I'm wrong.
However, since the library is multiplatform, I suppose architectures like ARM and ESP32's processors are also considered. With this forum post, I see that multiplication is just 2 cycles and division is 4 cycles for ESP32. I believe this is for 16-bit or 32-bit registers. Removing all of the rest of the for-loop instruction and invocation burden can potentially help to speed it up for those processors.
The second most time-consuming (~10%) part in the Knuth division I spotted was surprisingly multiply-and-subtract. I used -O2 so, no for-loops unrolled. Maybe this is an interesting point for optimization or unrolling in the division.
Oh, I just read the article in the second comment, where they optimized 128-bit by 64-bit division by manually implementing division in this case. Maybe it's also useful for 64 by 32 bit optimization for other platforms.
Benchmark Time(ns) CPU(ns)
BM_DivideClass128UniformDivisor 13.8 13.8 // 128 bit by 128 bit
BM_DivideClass128SmallDivisor 168 168 // 128 bit by 64 bit
into
Benchmark Time(ns) CPU(ns)
BM_DivideClass128UniformDivisor 11.9 11.9
BM_DivideClass128SmallDivisor 26.6 26.6
This can significantly speed up by just replacing division in a single place, from what I remember from the code.
Hi Maksym (@maksymbevza)
Many, many thanks for your very insightful comments.
With great enthusiasm I read your previous comments and these all make sense to me.
Division is a thing that I know I need to work on, ... but I sigh a breath of stationary relief that is works and has been tested on billions of cases.
I can't tell you how many times I found small defects in long division until it worked. Well it was a lot.
So I'm generally interesed in all your points, also the case(s) of division n / (1/2) n
.
If, when, and, ... if I don't break anything else, I'm going to look into these.
Of course, long Knuth division will never reach logarithmic or smaller order complexity and I've not written it to use any Karatsuba. But i could. Again, a long-term interest.
After long deliberations, I've closed this. The wide_integer library is probably not going to go up to hundreds of thousands or millions of decimal deigts. So we stick with Knuth long division. The experienceof #378 and #379 show us that long division is implemented quite will in master. So we leave this one alone.
Division uses Knuth long division routine. Is it possible to implement a faster division scheme?