Closed reunanen closed 6 months ago
A small observation: This change leads to five mult-operations for the product, while other implementations (like clipper1 or Emulate64x64to128) just need four.
@tomwiel Right – and that's because carry is calculated separately, which is leftover from a previous version that bailed out after a straightforward (carry-ignoring) multiplication, in case the result was not equal. Good catch, let me try to improve on this front...
I can't say that I fully understand why exactly
I think I've finally gotten my head around this ...
a * b ==>
split a and b into upper and lower 32bits
(aHi + aLo) * (bHi + bLo) ==>
(aHi * bHi) + (aHi * bLo) + (aLo * bHi) + (aLo * bLo) [ie 4 multiples]
1. aHi * bHi: XXXXXXXX00000000 * XXXXXXXX00000000
2. aHi * bLo: XXXXXXXX00000000 * 00000000XXXXXXXX
3. aLo * bHi: XXXXXXXX00000000 * 00000000XXXXXXXX
4. aLo * bLo: 00000000XXXXXXXX * 00000000XXXXXXXX
{ overflow bits }
1. XXXXXXXX XXXXXXXX 00000000 00000000 +
2. 00000000 XXXXXXXX XXXXXXXX 00000000 +
3. 00000000 XXXXXXXX XXXXXXXX 00000000 +
4. 00000000 00000000 XXXXXXXX XXXXXXXX
given bit shifting to keep each multiplication within 64 bits
then overflow bits equals ...
a. all of (1) PLUS
b. upper 32bits of both (2) and (3) PLUS
c. overflow of addition of (4) and lower 32bits of both (2) and (3)
note: overflow of addition in c. will be between 0 and 2.
What I hadn't appreciated until now is the addition c. above can potentially overflow by 2.
This change leads to five mult-operations for the product
This should be fixed now.
I can't say that I fully understand why exactly
I think I've finally gotten my head around this ...
// aLo = a & 0xFFFFFFFF; // aHi = a & 0xFFFFFFFF00000000; // bLo = b & 0xFFFFFFFF; // bHi = b & 0xFFFFFFFF00000000; // a * b == (aHi + aLo) * (bHi + bLo) // a * b == (aHi * bHi) + (aHi * bLo) + (aLo * bHi) + (aLo * bLo) // (aHi * bHi) => up to 128bits where bottom 64bits must be 0 // (aHi * bLo) and (bHi * aLo) => up to 96bits where bottom 32bits must be 0 // (aLo * bLo) => up to 64bits // 64bit overflow carry of a * b consists of // 1. all of (aHi * bHi) PLUS // 2. the upper 32bits of both (aHi * bLo) and (bHi * aLo) PLUS // 3. 0 - 2: the overflow of ((aHi * bLo)<<32) + (bHi * aLo)<<32 + (aLo * bLo)
What I hadn't appreciated until now is that ((aHi bLo)<<32) + (bHi aLo)<<32 + (aLo * bLo) can potentially overflow by 2.
Yes, just realized that much later than you. 😅
And found out that that extra_carry
can be computed as:
uint64_t extra_carry = (((aHiShr * bLo) & 0xFFFFFFFF) +
((bHiShr * aLo) & 0xFFFFFFFF) +
((aLo * bLo) >> 32)) >> 32;
And found out that that
extra_carry
can be computed as:
Neat. Thanks. And I've just rewritten my overflow explanation above so I hope it's less confusing about when aHi etc are shifted.
Add test cases provided by @azrafe7, and update carry calculation to a method that seems to actually work.
I can't say that I fully understand why exactly, but I too compared this against the naive method (using a native uint128 type), and this approach seems to work when the previous one does not always give exactly the same results.