More carry calculation fixing

reunanen commented 6 months ago

Add test cases provided by @azrafe7, and update carry calculation to a method that seems to actually work.

I can't say that I fully understand why exactly, but I too compared this against the naive method (using a native uint128 type), and this approach seems to work when the previous one does not always give exactly the same results.

tomwiel commented 6 months ago

A small observation: This change leads to five mult-operations for the product, while other implementations (like clipper1 or Emulate64x64to128) just need four.

reunanen commented 6 months ago

@tomwiel Right – and that's because carry is calculated separately, which is leftover from a previous version that bailed out after a straightforward (carry-ignoring) multiplication, in case the result was not equal. Good catch, let me try to improve on this front...

AngusJohnson commented 6 months ago

I can't say that I fully understand why exactly

I think I've finally gotten my head around this ...

  a * b ==>
  split a and b into upper and lower 32bits
  (aHi + aLo) * (bHi + bLo) ==>
  (aHi * bHi) + (aHi * bLo) + (aLo * bHi) + (aLo * bLo) [ie 4 multiples]
  1. aHi * bHi: XXXXXXXX00000000 * XXXXXXXX00000000
  2. aHi * bLo: XXXXXXXX00000000 * 00000000XXXXXXXX
  3. aLo * bHi: XXXXXXXX00000000 * 00000000XXXXXXXX
  4. aLo * bLo: 00000000XXXXXXXX * 00000000XXXXXXXX
     { overflow bits }
  1. XXXXXXXX XXXXXXXX 00000000 00000000 +
  2. 00000000 XXXXXXXX XXXXXXXX 00000000 +
  3. 00000000 XXXXXXXX XXXXXXXX 00000000 +
  4. 00000000 00000000 XXXXXXXX XXXXXXXX
  given bit shifting to keep each multiplication within 64 bits
  then overflow bits equals ...
  a. all of (1) PLUS
  b. upper 32bits of both (2) and (3) PLUS
  c. overflow of addition of (4) and lower 32bits of both (2) and (3)
  note: overflow of addition in c. will be between 0 and 2.

What I hadn't appreciated until now is the addition c. above can potentially overflow by 2.

reunanen commented 6 months ago

This change leads to five mult-operations for the product

This should be fixed now.

azrafe7 commented 6 months ago

I can't say that I fully understand why exactly

I think I've finally gotten my head around this ...

  // aLo = a & 0xFFFFFFFF;
  // aHi = a & 0xFFFFFFFF00000000;
  // bLo = b & 0xFFFFFFFF;
  // bHi = b & 0xFFFFFFFF00000000;

  // a * b == (aHi + aLo) * (bHi + bLo)
  // a * b == (aHi * bHi) + (aHi * bLo) + (aLo * bHi) + (aLo * bLo)
  // (aHi * bHi) => up to 128bits where bottom 64bits must be 0
  // (aHi * bLo) and (bHi * aLo)  => up to 96bits where bottom 32bits must be 0
  // (aLo * bLo) => up to 64bits

  // 64bit overflow carry of a * b consists of 
  // 1. all of (aHi * bHi) PLUS 
  // 2. the upper 32bits of both (aHi * bLo) and (bHi * aLo) PLUS
  // 3. 0 - 2: the overflow of ((aHi * bLo)<<32) + (bHi * aLo)<<32 + (aLo * bLo)

What I hadn't appreciated until now is that ((aHi bLo)<<32) + (bHi aLo)<<32 + (aLo * bLo) can potentially overflow by 2.

Yes, just realized that much later than you. 😅

And found out that that extra_carry can be computed as:

uint64_t extra_carry = (((aHiShr * bLo) & 0xFFFFFFFF) +
                        ((bHiShr * aLo) & 0xFFFFFFFF) +
                        ((aLo * bLo) >> 32)) >> 32;

AngusJohnson commented 6 months ago

And found out that that extra_carry can be computed as:

Neat. Thanks. And I've just rewritten my overflow explanation above so I hope it's less confusing about when aHi etc are shifted.

AngusJohnson / Clipper2

More carry calculation fixing #835