FP exception in GetIntersectPoint()

AngusJohnson / Clipper2

Polygon Clipping and Offsetting - C++, C# and Delphi

Boost Software License 1.0

1.47k stars 270 forks source link

FP exception in GetIntersectPoint() #317

Closed bahvalo closed 1 year ago

bahvalo commented 1 year ago

Executing the following code results in a floating point exception in function Point64 GetIntersectPoint(const Active& e1, const Active& e2). Namely, when a double value is cast to int64_t, it appears to be outside the range of int64_t.

The coordinates of the polygon vertices are within the range -4.6e18 ... 4.6e18. According to the documentation, that is a valid input.

#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <fenv.h>
#include <glob.h>
#include "clipper2/clipper.h"
using namespace Clipper2Lib;

#define ADD(A,X,Y) A.push_back(Point64(int64_t(X), int64_t(Y)))

int main(int, char**) {
    feenableexcept( FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW );

    Path64 a, b;
    ADD(a,          -7009388980LL,     1671079308836362LL);
    ADD(a,  -576460711222920704LL,     1671063793410802LL);
    ADD(a,  -864690986154904320LL,  -286977111392363424LL);
    ADD(a, -1152921411648951168LL,  -575625207815834176LL);
    ADD(a,  -864691057648438912LL,  -864273328105026304LL);
    ADD(a,  -576460762799345152LL, -1152921434302619648LL);
    ADD(a,         -12880478468LL, -1152921451122536832LL);
    ADD(a,   576460787720869760LL, -1152921504606846976LL);
    ADD(a,   864691268131862400LL,  -864273433561362176LL);
    ADD(a,  1152921504606846848LL,  -575625211080225088LL);
    ADD(a,   864691062448491520LL,  -286977010832613792LL);
    ADD(a,   576460654512679296LL,     1671130833237401LL);

    ADD(b,   -54234486065476976LL,  1151250415789406208LL);
    ADD(b,  -612617068314194048LL,  1152921504606846976LL);
    ADD(b,  -864691197085273984LL,   867615548203339008LL);
    ADD(b, -1152921504606846848LL,   578967376389736064LL);
    ADD(b,  -864691140504451968LL,   290319223463759488LL);
    ADD(b,  -576460711222912256LL,     1671063793410802LL);
    ADD(b,          -7009388980LL,     1671079308836362LL);
    ADD(b,   576460654512679296LL,     1671130833237401LL);
    ADD(b,   864690908098943872LL,   290319324023499392LL);
    ADD(b,  1100313577101746304LL,   576428327042791872LL);
    ADD(b,   785779376874207360LL,   863806873623168128LL);
    ADD(b,   487696647658790656LL,  1150382388220066304LL);

    const int m = 10;
    for(size_t i=0; i<a.size(); i++) { a[i].x /= m; a[i].y /= m; b[i].x /= m, b[i].y /= m; }

    Paths64 AA; AA.push_back(a);
    Paths64 BB; BB.push_back(b);
    Paths64 solution = Intersect(Paths64(AA), Paths64(BB), FillRule::NonZero);
}

Division by ten is not essential.

alexisnaveros commented 1 year ago

@alexisnaveros Please consider the following test:
  l0p0x = 0;
  l0p0y = 0;
  l0p1x = INT64_MAX/4;
  l0p1y = INT64_MAX/4;
  l1p0x = some_small_value;
  l1p0y = 0;
  l1p1x = INT64_MAX/4-some_small_value;
  l1p1y = INT64_MAX/4;   
Here small_value may be 1000, 100, 10, 1.

@bahvalo Eheh yup, that's a tricky case. I was less concerned about the very exact intersection point when both segments are monstrously huge... (getting a hit "outside" the bounding box of a tiny edge was more worrying to me)

But let's see, here small_value is 1000!

We should get : 1152921504606846975 1152921504606846975
            A : 1125899906842624000 1125899906842624000 ~ error 38214310035154632
            B : 1152921504606846976 1152921504606846976 ~ error 2
            C : 1125899906842624000 1125899906842624000 ~ error 38214310035154632
            D : 1125899906842624000 1125899906842624000 ~ error 38214310035154632
            E : 1152921504606846976 1152921504606846976 ~ error 2
            F : 1152921504606846975 1152921504606846975 ~ error 0
            G : 1125899906842624000 1125899906842624000 ~ error 38214310035154632
            H : 1152921504606846975 1152921504606846975 ~ error 0
          PB3 : 1152921504606846975 1152921504606846975 ~ error 0
        D-SSE : 1125899906842624000 1125899906842624000 ~ error 38214310035154632
        E-SSE : 1152921504606846976 1152921504606846976 ~ error 2
        F-SSE : 1152921504606846975 1152921504606846975 ~ error 0
        G-SSE : 1119144507401568256 1119144507401568256 ~ error 47767887543943296

I was expecting better from G(). The new H() combines the best of F() and G(), it makes two guesses before making the actual computation.

But let's see. Here, small_value is 100!

We should get : 1152921504606846975 1152921504606846975
            A : 900719925474099200 900719925474099200 ~ error 356666893661443264
            B : 1152921504606846976 1152921504606846976 ~ error 2
            C : 900719925474099200 900719925474099200 ~ error 356666893661443264
            D : 900719925474099200 900719925474099200 ~ error 356666893661443264
            E : 1152921504606846976 1152921504606846976 ~ error 2
            F : 1152921504606846975 1152921504606846975 ~ error 0
            G : 900719925474099200 900719925474099200 ~ error 356666893661443264
            H : 1152921504606846975 1152921504606846975 ~ error 0
          PB3 : 1152921504606846975 1152921504606846975 ~ error 0
        D-SSE : 900719925474099200 900719925474099200 ~ error 356666893661443264
        E-SSE : 1152921504606846976 1152921504606846976 ~ error 2
        F-SSE : 1152921504606846975 1152921504606846975 ~ error 0
        G-SSE : 891712726219358208 891712726219358208 ~ error 369404997006494784

Still okay. Here, small_value is 10!

We should get : 1152921504606846975 1152921504606846975
            A : -1 -1 ~ error 1630477228166597888
            B : -1 -1 ~ error 1630477228166597888
            C : -1 -1 ~ error 1630477228166597888
            D : -1 -1 ~ error 1630477228166597888
            E : -1 -1 ~ error 1630477228166597888
            F : -1 -1 ~ error 1630477228166597888
            G : -1 -1 ~ error 1630477228166597888
            H : -1 -1 ~ error 1630477228166597888
          PB3 : 1152921504606846975 1152921504606846975 ~ error 0
        D-SSE : -1 -1 ~ error 1630477228166597888
        E-SSE : -1 -1 ~ error 1630477228166597888
        F-SSE : -1 -1 ~ error 1630477228166597888
        G-SSE : -1 -1 ~ error 1630477228166597888

Okay, yup, my functions just crash and burn, while your int128_t integer function gets it perfectly right.

EDIT: Updated last table. Instead of printing uninitialized data, -1,-1 means the call failed because det == 0.0. So the function F and H worked until the end, except when it thinks the two lines are colinear, because casting ln0a/ln0b/ln1a/ln1b to double lost some bits of precision. I think that can actually be fixed, just catching the very rare det==0.0 with some fallback plan...

AngusJohnson commented 1 year ago

Great! One group of tests successfully passed. Now one more case with a wrong result.

To solve your new test, it looks like I'll have to reduce the coordinate range from +/- MAX_INT64 /2 to +/- MAX_INT64 /4. And this is largely because of the imprecision of floating point calculations, especially with values close to MAX_COORD or MIN_COORD. Even with this reduced coordinate range, values close to those limits that are converted to type double will only be accurate to +/- 16. (And unfortunately the long double type only maps to double in MSVC++.) For example:

  static const double min_coord = static_cast<double>(MIN_COORD);
  double val = min_coord;
  (val + 11.0 == min_coord) && (val - 4.0 == min_coord) // ie both statements return true

alexisnaveros commented 1 year ago

Before actually reading Clipper2, I really loved that it was all based on int64_t, and therefore the numerical accuracy would be constant everywhere within the int64_t range.

I later realized that's not quite true at the moment, but... if we state that only the difference between two points (of an edge) can occasionally be converted to double, if it is prohibited to do any int64_coordinate<->double anywhere in the math, then numerical accuracy is maintained everywhere within the int64_t range as long as edges aren't too long.

Bahvalo is throwing test cases that fail because the edges are just too darn long (like when it ends up with double det==0.0 in my intersection tests). But I think all these "edge too long" cases can be reasonably detected and handled with special paths, without too much trouble.

I'm not sure if you want that extra complexity, but if not, it's something I would like to implement on my end. I can detect when any conversion to double of a difference between two points has to throw away some bits, and fall back on a slower, more robust path.

AngusJohnson commented 1 year ago

Thanks Alexis.

As I've said above, my primary goal is that the library is robust. After that, my aim is to achieve optimum performance together with near perfect (ie non-rounded integer) accuracy for coordinates within a very generous range (currently looks like +/-1.0e15).

However, as this thread has amply demonstrated, some compromises in accuracy and/or performance are needed when coordinate values approach the current MAX_INT64/2 and MIN_INT64/2. And currently I see two likely use cases for the library:

Those who don't need or are prepared to restrict coodinate values to a smaller range (say +/-1.0e15) where they can expect clipping solutions that have near perfect integer accuracy at best possible performance.
Those who need as much of the int64_t range as possible to achieve the best possible precision, and are prepared to sacrifice performance (within reason).

So some of the issues I'm trying to sort through now are:

Can I determine (and document) the range for scenario 1 (fast_mode) above?
What's the algorithm that will achieve best possible precision (ie precision_mode) without overly compromising performance, and without overly complicating the library, and how much of the int64_t range can be safely used?
How best to switch between fast_mode and precision_mode (?? pre-compiler directive)
What happens in fast_mode when coordinate values exceed the specified optimum range (and how to manage this).

alexisnaveros commented 1 year ago

Right... Well, let's note that, if we assume that edges are short (max 2^53?), then using the whole INT64_MAX/2 range isn't an obstacle to getting exact results.

For example, just to pick GetClosestPointOnSegment() posted above:

[...]
return Point64(
  static_cast<int64_t>(nearbyint((1 - q) * seg1.x + q * seg2.x)),
  static_cast<int64_t>(nearbyint((1 - q) * seg1.y + q * seg2.y)));

Can become:

[...]
return Point64(
  seg1.x + static_cast<int64_t>nearbyint( q * static_cast<double>( seg2.x - seg1.x ) ),
  seg1.y + static_cast<int64_t>nearbyint( q * static_cast<double>( seg2.y - seg1.y ) ) );

And now that's accurate throughout the whole int64_t range, without breaking down when numbers are large (unless the difference between points exceeds 2^53). There are a couple other minor such changes that can be made throughout the code.

With these changes, then the accuracy begins to fall apart when a difference between two points can not be stored in a double without cutting bits out. I imagine the code could end up with a bunch of:

int64_t diff = pt2.x - pt1.x;
if( ( diff <= -((int64_t)1<<53) ) || ( diff >= ((int64_t)1<<53) )
  return fallback_robust_slow_calculation_mode();
diff_double = (double)diff;
// fast and easy mode, yay

Not very pretty, although the branch would be very predictable for datasets with small numbers. The branch could optionally be cut off by a compilation-time #if, for those who want "fast mode" even when using large numbers?

The remaining fun part is writing these fallback_robust_slow_calculation_mode(), eheh. Bahvalo has shown a good example with mathVertexLineLineIntersection_PB3(), although ideally without int128_t and __builtin_clzll().

EDIT: To test multiple differences in one go (to enter fallback_robust_slow_calculation_mode() or not), in a single easy-to-predict branch, a little optimization would be:

int64_t diffA = pt2.x - pt1.x;
int64_t diffB = pt2.y - pt1.y;
int64_t diffC = pt3.x - pt1.x;
int64_t diffD = pt3.y - pt1.y;
int64_t foobar = (int64_t)1<<53;
if( ( ( diffA+foobar ) | ( diffB+foobar ) | ( diffC+foobar ) | ( diffD+foobar ) ) & ( (int64_t)0x3ff << 54 ) )
  return fallback_robust_slow_calculation_mode();

(Sorry, I live with a world where the words "premature optimization" don't exist)

bahvalo commented 1 year ago

I'll have to reduce the coordinate range from +/- MAX_INT64 /2 to +/- MAX_INT64 /4.

This will not help, because in my example the coordinates are within the range +/- MAX_INT64/8. And the effect preserves when we divide the coordinates by 32, i. e. they become within the range +/- MAX_INT64/256.

Restricting the range to +/- 1e+15 looks reasonable to me. My input data are double-precision floating point, and I need a double-precision floating point as the result, so this range is fine.

But I got FPE again.

#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <fenv.h>
#include <glob.h>
#include "clipper2/clipper.h"
using namespace Clipper2Lib;

#define ADD(A,X,Y) A.push_back(Point64(int64_t(X), int64_t(Y)))

int main(int, char**) {
    feenableexcept( FE_DIVBYZERO | FE_INVALID | FE_OVERFLOW );
    Path64 a,b;

    ADD(a,18750000000000LL,12499999999990LL);
    ADD(a,12500000000000LL,24999999999981LL);
    ADD(a,0LL,24999999999980LL);
    ADD(a,-12499999999996LL,24999999999979LL);
    ADD(a,-18749999999995LL,12499999999986LL);
    ADD(a,-24999999999991LL,-10LL);
    ADD(a,-18749999999990LL,-12500000000004LL);
    ADD(a,-12499999999986LL,-25000000000000LL);
    ADD(a,12LL,-24999999999997LL);
    ADD(a,12500000000008LL,-24999999999990LL);
    ADD(a,18750000000000LL,-12499999999990LL);
    ADD(a,25000000000000LL,1LL);
    ADD(b,18749999999991LL,12500000000008LL);
    ADD(b,12499999999990LL,25000000000000LL);
    ADD(b,-9LL,24999999999997LL);
    ADD(b,-12500000000006LL,24999999999995LL);
    ADD(b,-18750000000004LL,12500000000002LL);
    ADD(b,-24999999999999LL,5LL);
    ADD(b,-18749999999998LL,-12499999999988LL);
    ADD(b,-12499999999993LL,-24999999999983LL);
    ADD(b,5LL,-24999999999980LL);
    ADD(b,12500000000000LL,-24999999999972LL);
    ADD(b,18749999999992LL,-12499999999972LL);
    ADD(b,24999999999991LL,20LL);

      Paths64 AA; AA.push_back(a);
      Paths64 BB; BB.push_back(b);
      Paths64 solution = Intersect(Paths64(AA), Paths64(BB), FillRule::NonZero);

      printf("S = %25.15e\n", Area(solution));
}

This example is not made by hand.

AngusJohnson commented 1 year ago

But I got FPE again.

Yep, just about to upload the fix 😁. (And This one was a case of an old bug removed and a new one added 😱)

This example is not made by hand.

I wouldn't mind if it was. I'm just hoping to squash these remaining bugs.

bahvalo commented 1 year ago

Now the error of the intersection area is about 6e+22, which is large enough for initial data within 1e+14 range.

AngusJohnson commented 1 year ago

Now the error of the intersection area is about 6e+22, which is large enough for initial data within 1e+14 range.

Sorry, I don't follow?

bahvalo commented 1 year ago

My last example yields

S =     1.875060039840250e+27
S =     1.874999999997778e+27
S =     1.875347110620309e+27
S =     1.874999999997365e+27
S =     1.874999999997015e+27
S =     1.874999999995516e+27

Rounding the input coordinates can't affect so much. Assuming that the last result is correct, the error of the first intersection area is 6e+22. For the initial data within +/- 1e+14 range, I consider this result to be inaccurate.

alexisnaveros commented 1 year ago

I confirm that the problem is yet again the intersection test. The library currently implements the "A" solution and it's too inaccurate.

On a different note, if you would like to restrict the library's input range to 1<<52, then a SSE trick can be used to convert a pair of int64_t<->double about 3 times faster than a C/C++ (double)my_int64 or (int64_t)my_double would (total: 6 times faster). As you can imagine, that makes the SSE intrinsics really fly, finally.

EDIT: Just inspected further, you are actually asking for the intersections between:

Line0: 12500000000000 24999999999981 to 18750000000000 12499999999990
Line1: 12499999999990 25000000000000 to 18749999999991 12500000000008
......
Line0: 18749999999991 12500000000008 to 24999999999991 20
Line1: 18750000000000 12499999999990 to 25000000000000 1

and so on.

These are difficult cases, it reminds me of your earlier small_value tests. Even the current "fancy" intersection tests return points that are just slightly out of the "bounding box" of the lines (the area isn't affected much, but it's still wrong!).

EDIT 2: Okay, it's not enough for the input to be within 2^52 range for the floating point math to be accurate. There's further rounding going on in "det" and so on. Argh!

alexisnaveros commented 1 year ago

As this saga still doesn't want to end, as I was really tired of never knowing what the proper answers were, I wrote an intersection test using my 1024 bits math library (all written in amd64 assembly, yarr).

The BN1024 results are absolutely correct, the two cases above:

Line0: 12500000000000 24999999999981 to 18750000000000 12499999999990
Line1: 12499999999990 25000000000000 to 18749999999991 12500000000008
 -=Intersection =-
We should get : 18749999999910 12500000000170
            A : 18716242253290 12481001853171 ~ error 38736482110
            B : 18749067298232 12501865403526 ~ error 2085584355
            C : 18747573756714 12504808554489 ~ error 5385986510
            D : 18743051132788 12513688432033 ~ error 15351218882
            E : 18745878208877 12508243582235 ~ error 9216604938
            F : 18745878208877 12508243582235 ~ error 9216604938
            G : 18743051132788 12513688432033 ~ error 15351218882
            H : 18737564891724 12524870216543 ~ error 27805747213
          PB3 : 18749999999909 12500000000170 ~ error 1
       BN1024 : 18749999999910 12500000000170 ~ error 0

Only PB3 gets it right. The difference of 1 is due to a lack of rounding in PB3(), easily fixed.

Line0: 18749999999991 12500000000008 to 24999999999991 20
Line1: 18750000000000 12499999999990 to 25000000000000 1
 -=Intersection =-
We should get : 18749999999892 12500000000206
            A : 18800852960221 12516376449324 ~ error 53424822508
            B : 18747327264085 12505345471819 ~ error 5976418950
            C : 18750018496234 12499721864080 ~ error 278750461
            D : 18749999999892 12500000000206 ~ error 0
            E : 18744313563339 12511372873313 ~ error 12715258684
            F : 18744313563339 12511372873313 ~ error 12715258684
            G : 18750075295280 12499795466152 ~ error 217953148
            H : 18750006649306 12499986701378 ~ error 14868542
          PB3 : -1 -1 ~ error 22534695471676
       BN1024 : 18749999999892 12500000000206 ~ error 0

The intersection point between the two vectors is actually outside of the line0 segment, the segments don't intersect. PB3 returned a failure because this check fails:

    if( alpha_num < 0 || beta_num < 0 || alpha_num > denom || beta_num > denom )
      return 0;

and that's quite correct, the hit is actually outside the line0 range (alpha_num is negative).

sergey-239 commented 1 year ago

Meanwhile, I have been trying to speedup PB3:

__attribute__ ((always_inline))
static inline 
__int128 imul128 (int64_t a, int64_t b)
{
    __int128 res;
    asm (
    "imulq %2        ;;\n\t"
    :"=A"(res): "a" (a), "rm" (b));
    return res;
}

__attribute__ ((always_inline))
static inline 
int64_t idiv128 (__int128 num, int64_t denom)
{
    int64_t res;
    asm (
    "idiv %1        ;;\n\t"
    :"=a"(res): "rm" (denom), "A" (num));
    return res;
}

static inline int mathVertexLineLineIntersection_PB3_idiv( int64_t *hitpt, int64_t l0p0x, int64_t l0p0y, int64_t l0p1x, int64_t l0p1y, int64_t l1p0x, int64_t l1p0y, int64_t l1p1x, int64_t l1p1y )
{
  __int128     denom = imul128(l0p1x-l0p0x, l1p1y-l1p0y) - imul128(l0p1y-l0p0y, l1p1x-l1p0x);
  __int128 alpha_num = imul128(l1p0x-l0p0x, l1p1y-l1p0y) - imul128(l1p0y-l0p0y,l1p1x-l1p0x);
  __int128  beta_num = imul128(l1p0x-l0p0x, l0p1y-l0p0y) - imul128(l1p0y-l0p0y, l0p1x-l0p0x);

  if(denom==0) {
    if(alpha_num || beta_num) return 0; // segments lay on parallel lines
    // special case with segments on the same line -- may be considered if necessary
    return 0;
  }
  else {

    if(denom<0) { alpha_num=-alpha_num; beta_num=-beta_num; denom=-denom; }
    if(alpha_num<0 || beta_num<0 || alpha_num>denom || beta_num>denom) return 0;
    int64_t* high = (int64_t*)(&denom)+1;
    if(*high) {
      int shift = 65-__builtin_clzll(*high);
      alpha_num>>=shift; 
      denom>>=shift;
    }
    hitpt[0] = l0p0x+idiv128(imul128(alpha_num, l0p1x-l0p0x), denom);
    hitpt[1] = l0p0y+idiv128(imul128(alpha_num, l0p1y-l0p0y), denom);
    return 1;
  }
}

Regardless of the fact that original implementation uses true 128 bit multiplication and division and the latter is implemented using routine call, this one is slower. It might be that use of asm inlines somehow interfers with optimiser so it can't properly order commands for pipelining.

@alexisnaveros, could you please give it a try in your accuracy test.

AngusJohnson commented 1 year ago

As this saga still doesn't want to end

It has indeed been frustratingly difficult, but I think we've made considerable progress, and hopefully we're not too far from fully resolving this.

alexisnaveros commented 1 year ago

@sergey-239 Nice. There's this very minor difference:

Line0: 12500000000000 24999999999981 to 18750000000000 12499999999990
Line1: 12499999999990 25000000000000 to 18749999999991 12500000000008
 -=Intersection =-
We should get : 18749999999910 12500000000170
          PB3 : 18749999999909 12500000000170 ~ error 1
          PB4 : 18749999999909 12500000000171 ~ error 2
       BN1024 : 18749999999910 12500000000170 ~ error 0

which apparently can't be explained by a lack of rounding.

That new version is actually faster by 7% on my box, dual-Epyc 7532, gcc 11.3 compiled with -O3 -ffast-math -march=native -mtune=native

By the way, you can optimize this:

    if( alpha_num < 0 || beta_num < 0 || alpha_num > denom || beta_num > denom )
      return 0;

as:

    if( (unsigned __int128)alpha_num > denom || (unsigned __int128)beta_num > denom )
      return 0;

I have looked at the assembly, comparing PB3 and PB4. A big difference is that PB4 is constantly using the single-operand imulq form, producing a 128 bits RDX:RAX result, while PB3 emits some (faster) two-operands imulq producing just a 64 bits results, and some add-with-carry (adcq) to accumulate into a 128 bits RDX:RAX ready for the division.

That stuff is messy with just little bits of inline assembly. I probably would just write the whole thing in inline assembly, it's only about 150 instructions...

sergey-239 commented 1 year ago

as:

    if( (unsigned __int128)alpha_num > denom || (unsigned __int128)beta_num > denom )
      return 0;

Better:

if(((unsigned __int128)alpha_num | (unsigned __int128)beta_num) > denom) return 0;

alexisnaveros commented 1 year ago

I'm not quite sure that's better though. :)

My version was: cmp, jcc, cmp, jcc Your version is: cmp, setcc, cmp, setcc, or, jcc

Yes, you have only one branch, but it's more instructions to get there, and the tests should pretty much always be false. Still, if each condition had a random 50% chance of being true, then your single branch solution would indeed be better (taken 75% of the time).

EDIT: Wait sorry, I misread your code as:

if( ( (unsigned __int128)alpha_num > denom ) | ( (unsigned __int128)beta_num > denom ) ) return 0;

Doing an OR of the numbers themselves can produce something above denom, while each one was below "denom" to begin with. It could be a first early test before doing the more robust one though.

sergey-239 commented 1 year ago

nope, it must be or; cmp; jb

sergey-239 commented 1 year ago

Doing an OR of the numbers themselves can produce something above denom, while each one was below "denom" to begin with. It could be a first early test before doing the more robust one though.

I am giving up! You are correct

alexisnaveros commented 1 year ago

Hey, I improved your version a little bit:

/* denom must be >= 0 for proper rounding behavior */
__attribute__ ((always_inline)) static inline int64_t idiv128_round( __int128 num, int64_t denom )
{
  int64_t res, rem;
  asm (
  "idivq %2\n\t"
  : "=a"(res), "=d"(rem) : "rm" (denom), "A" (num) );
  res += ( rem >= ( denom >> 1 ) );
  res -= ( rem <= -( denom >> 1 ) );
  return res;
}

__attribute__ ((always_inline)) static inline  __int128 imul128( int64_t a, int64_t b )
{
  __int128 res;
  asm (
  "imulq %2\n\t"
  : "=A"(res) : "a" (a), "rm" (b) );
  return res;
}

static inline int mathVertexLineLineIntersection_PB5( int64_t *hitpt, int64_t l0p0x, int64_t l0p0y, int64_t l0p1x, int64_t l0p1y, int64_t l1p0x, int64_t l1p0y, int64_t l1p1x, int64_t l1p1y )
{
  __int128 denom;
  __int128 alpha_num;
  denom = imul128( l0p1x - l0p0x, l1p1y - l1p0y ) - imul128( l0p1y - l0p0y, l1p1x - l1p0x );
  if( denom == 0 )
  {
    // segments lay on parallel lines
    // special case with segments on the same line -- may be considered if necessary
    return 0;
  }
  else
  {
    alpha_num = imul128( l1p0x - l0p0x, l1p1y - l1p0y ) - imul128( l1p0y - l0p0y, l1p1x - l1p0x );
    if( denom < 0 )
    {
      alpha_num = -alpha_num;
      denom = -denom;
    }
    int64_t *high = (int64_t *)(&denom)+1;
    if( *high )
    {
      int shift = 65 - __builtin_clzll( *high );
      alpha_num >>= shift; 
      denom >>= shift;
    }
    hitpt[0] = l0p0x + idiv128_round( imul128( alpha_num, l0p1x - l0p0x ), denom );
    hitpt[1] = l0p0y + idiv128_round( imul128( alpha_num, l0p1y - l0p0y ), denom );
  }
  return 1;
}

It's not longer checking if the intersection point is within the two edges, I suspect Clipper2 wants the intersection point between the two vectors; it already decided that they intersect when that function is called.

The extra rounding corrects these tiny errors:

Line0: 12500000000000 24999999999981 to 18750000000000 12499999999990
Line1: 12499999999990 25000000000000 to 18749999999991 12500000000008
 -=Intersection =-
We should get : 18749999999910 12500000000170
          PB3 : 18749999999909 12500000000170 ~ error 1
          PB4 : 18749999999909 12500000000171 ~ error 2
          PB5 : 18749999999910 12500000000170 ~ error 0
       BN1024 : 18749999999910 12500000000170 ~ error 0

Benchmark A     : 51 ms
Benchmark PB3   : 212 ms
Benchmark PB4   : 203 ms
Benchmark PB5   : 205 ms
Benchmark BN1024: 32458 ms

Hey, don't laugh at my 1024 bits solution! It's fast, I'm telling you. 😁

sergey-239 commented 1 year ago

Hey, don't laugh at my 1024 bits solution! It's fast, I'm telling you.

Why should I?

@sergey-239 Nice. There's this very minor difference:

Line0: 12500000000000 24999999999981 to 18750000000000 12499999999990
Line1: 12499999999990 25000000000000 to 18749999999991 12500000000008
 -=Intersection =-
We should get : 18749999999910 12500000000170
          PB3 : 18749999999909 12500000000170 ~ error 1
          PB4 : 18749999999909 12500000000171 ~ error 2
       BN1024 : 18749999999910 12500000000170 ~ error 0

which apparently can't be explained by a lack of rounding.

The source is a loss of precision: when we scale numerator and denominator with 2^n we just throw out the least significant bits of both. Of course, the result would be different. Another bit of precision we could get by scaling both numerator and denominator to 64 most significant bits (now we do to most 63) and do all further calculations with unsigned versions of div and mul

sergey-239 commented 1 year ago

It's not longer checking if the intersection point is within the two edges, I suspect Clipper2 wants the intersection point between the two vectors; it already decided that they intersect when that function is called.

I think it's better to keep the optimized check commented out for clarity and record.

AngusJohnson commented 1 year ago

It's not longer checking if the intersection point is within the two edges, I suspect Clipper2 wants the intersection point between the two vectors; it already decided that they intersect when that function is called.

Yep. I suspect the best way forward is still to use algorithm 'A' but improve the bounds checking of the returned intersection point in AddNewIntersectNode. And this is what I'm tweaking at the moment. (The intersection point should only be wrong when the edges are very close to collinear.)

sergey-239 commented 1 year ago

@AngusJohnson if I understand it right they are close to collinear when the dx difference is small, am I correct? Why not fallback to _PB in its best variant in such cases? This algorithm is based on pure line equation Ax+By+C=0 and the most correct I have ever seen...

alexisnaveros commented 1 year ago

I think it's better to keep the optimized check commented out for clarity and record.

For the record, absolutely! Once that issue is "closed", I think we could preserve it as a discussion? I'm sure people google how to compute the exact intersection point between lines. I can even use all this right away in other contexts, instead of the 'A' solution that required me to do a bunch of slow error correction post-processing.

@AngusJohnson The output of 'A' is also quite wrong when intersecting a tiny edge far from the origin, even if perpendicular.

Later on, maybe just get the max/min of all input points, set a function pointer to 'A' if min/max is within a short safe range, otherwise set to BP5, BP17, or Z?

sergey-239 commented 1 year ago

@alexisnaveros There is another point to optimise: in most cases the coords of endpoints are in-memory Point64 structures (or alike), so x and y are laid out adjacently. In this algo we are computing lengths of projections, ie we do calculate p1.x - p0.x and p1.y-p0.y. This could be done using sse instructions. However then we need to store the results to be separately accessible just after subtraction, so 1) I am not sure that this would make any performance improvement as there will be load from mem, subtract from mem and then possibly unpacking to register allocated variables so 4 instructions per pair of points. 2) I don't know how to make an allocator with specific alignment requirement. Do you have an idea?

alexisnaveros commented 1 year ago

1. I am not sure that this would make any performance improvement as there will be load from mem, subtract from mem and then possibly unpacking to register allocated variables so 4 instructions per pair of points.

No, you need a lot more math using full length vectors (128, 256 bits) to be worth it. But it's not a serious problem if access is not perfectly vertical (I built "xy", but then I need "yx", "xx", "yy"); there are shuffle/swizzle/unpack instructions for that. My SSE versions of mathVertexLineLineIntersection_* were fully vectorized (2xdouble or 2xint64_t).

2. I don't know how to make an allocator with specific alignment requirement.
   Do you have an idea?

Sure, here's a minimalist one:

typedef struct
{
  int padding;
} mmAlign;

#define ADDRESS(p,o) ((void *)(((char *)p)+(o)))
#define ADDRESSDIFF(a,b) (((char *)a)-((char *)b))

void mmAlignAlloc( size_t bytes, intptr_t align )
{
  intptr_t i;
  void *v;
  mmAlign *malign;
  align--;
  if( !( v = malloc( bytes + align + sizeof(mmAlign) ) ) )
    return 0;
  i = ( (intptr_t)v + align + sizeof(mmAlign) ) & ~align;
  malign = ADDRESS( (void *)i, -(int)sizeof(mmAlign) );
  malign->padding = ADDRESSDIFF( i, v );
  return (void *)i;
}

void mmAlignFree( void *v )
{
  mmAlign *malign;
  if( v )
  {
    malign = ADDRESS( v, -(int)sizeof(mmAlign) );
    free( ADDRESS( v, -(int)malign->padding ) );
  }
  return;
}

Call with align=0x10 for SSE, align=0x20 for AVX, align=0x40 for AVX-512 or for cache line alignment, or even align=4096 for page alignment.

sergey-239 commented 1 year ago

No, you need a lot more math using full length vectors (128, 256 bits) to be worth it. But it's not a serious problem if access is not perfectly vertical (I built "xy", but then I need "yx", "xx", "yy"); there are shuffle/swizzle/unpack instructions for that. My SSE versions of mathVertexLineLineIntersection_* were fully vectorized (2xdouble or 2xint64_t).

I am speaking about further PB5 improvement. Then this will give no performance improvement as all four points are not necessarily allocated contiguously.

Sure, here's a minimalist one:

No, I meant c++ stl-style allocator. It seems it's quite easy, just found:

struct alignas(16) Point64 {
    int64_t x,y;
};

Despite of I first read the C++ related books in the end of eighties, the most time of my life I was programming in C, so now I am catching up...

alexisnaveros commented 1 year ago

I am speaking about further PB5 improvement.

I can still see minor improvements. Like eliminating that if( denom < 0 ) { alpha_num = -alpha_num; denom = -denom; } branch through a propagation of the sign bit by >> 63 of the high register, xor'ing both halves with that, and += that sign bit & 1. But these int128_t types sitting in two registers is not very C-style, I'll see if I can coerce the compiler to emit the right code.

Despite of I first read the C++ related books in the end of eighties, the most time of my life I was programming in C, so now I am catching up...

Eh, I'm in high performance computing and it's all C. Most colleagues do write C-style C++ (or C with classes), where stl is off-limits (all memory management goes through NUMA-aware code, with huge page support, etc. so forget new/delete/malloc/free).

sergey-239 commented 1 year ago

I can still see minor improvements. Like eliminating that if( denom < 0 ) { alpha_num = -alpha_num; denom = -denom; } branch through a propagation of the sign bit by >> 63 of the high register, xor'ing both halves with that, and += that sign bit & 1. But these int128_t types sitting in two registers is not very C-style, I'll see if I can coerce the compiler to emit the right code.

this makes nearly no notable change in performance. it seems that the cost of pipeline flush is equal to unconditional arithmetic evaluations:

    {
      /* branchless conditional negation */
      int64_t mask = (int64_t)(denom >> 64) >> 63;
      denom = (denom ^ mask) - mask;
      alpha_num = (alpha_num ^ mask) - mask;
    }

btw, my idea somewhere above about using unsigned mul and div is incorrect: l0p1x - l0p0x and l0p1y - l0p0y still may be negative

Edit: forgot to note that I did the test of above branchless negation by alternating endpoints of one of segments on each iteration, so denom is negative every second pass Edit2: the GCC's optimiser is smart enough to get that mask is either all ones or all zeroes (note the absence of mask sign propagation and the sequence of subq-sbbq):

# /home/rpmbuild/tmp/testcross/testinf.cpp:491:       int64_t mask = (int64_t)(denom >> 64) >> 63;
    movq    %r13, %rax  # denom, _7
    sarq    $63, %rax   #, mask
# /home/rpmbuild/tmp/testcross/testinf.cpp:492:       denom = (denom ^ mask) - mask;
    movq    %rax, %rcx  # mask, _9
    xorq    %r12, %rax  # denom, tmp190
    movq    %rax, %rsi  # tmp190, _10
    movq    %r13, %rax  # denom, tmp191
    xorq    %rcx, %rax  # _9, tmp191
    movq    %rax, %rdi  # tmp191, _10
    subq    %rcx, %rsi  # _9, _10
    sbbq    %rcx, %rdi  # _9, _10
    movq    %rsi, -24(%rsp) # denom, %sfp
    movq    %rdi, -16(%rsp) # denom, %sfp

alexisnaveros commented 1 year ago

Oh nice, GCC has become pretty smart. Although... what's up with all these extra "mov" everywhere? And we don't need to "sarx" the higher half of denom/alpha, we are bringing it down to 64 bits. Silly compiler.

The branchless version should be faster with an unpredictable if( denom < 0 ), which should be the case with real data...

I think I'm going to try something totally silly, give me 1-2 hours.

sergey-239 commented 1 year ago

Oh nice, GCC has become pretty smart. Although... what's up with all these extra "mov" everywhere? And we don't need to "sarx" the higher half of denom/alpha, we are bringing it down to 64 bits. Silly compiler.

The branchless version should be faster with an unpredictable if( denom < 0 ), which should be the case with real data...

I think I'm going to try something totally silly, give me 1-2 hours.

sar of the upper half was to fill the mask with sign of denom, so the compiler did it correctly. The problem with inline assembly is that you can't refer to int128 argument but rax:rdx (unless you rewrite the whole function in assembly) I also realized that by using idiv with int64 denom in the current implementation we introduced a bug when denom is in the range 2^63..2^64-1, i.e. it is 64 bit signed negative number. So we need to redesign it a bit.

alexisnaveros commented 1 year ago

unless you rewrite the whole function in assembly

Hum yeah... About that 😆

static inline int mathVertexLineLineIntersection_ASM1( int64_t *hitpt, int64_t l0p0x, int64_t l0p0y, int64_t l0p1x, int64_t l0p1y, int64_t l1p0x, int64_t l1p0y, int64_t l1p1x, int64_t l1p1y )
{
  int retval;
  int64_t hitptx, hitpty;

  __asm__ __volatile__(
    /* rax = l0p1x - l0p0x; (preserve to r11) */
    /* rbx = l1p1y - l1p0y; (preserve rbx) */
    "movq %5, %%rax\n"
    "movq %10, %%rbx\n"
    "subq %3, %%rax\n"
    "subq %8, %%rbx\n"
    "movq %%rax, %%r11\n"
    /* r10:rdi(denom) = ( l0p1x - l0p0x ) * ( l1p1y - l1p0y ); */
    "imulq %%rbx\n"
    "movq %%rax, %%rdi\n"
    "movq %%rdx, %%r10\n"
    /* rax = l0p1y - l0p0y; (preserve to r12) */
    /* r13 = l1p1x - l1p0x; (preserve r13) */
    "movq %6, %%rax\n"
    "movq %9, %%r13\n"
    "subq %4, %%rax\n"
    "subq %7, %%r13\n"
    "movq %%rax, %%r12\n"
    /* rdx:rax = ( l0p1y - l0p0y ) * ( l1p1x - l1p0x ); */
    "imulq %%r13\n"
    /* r10:rdi(denom) = imul() - imul(); */
    /* retval=0; if( denom == 0 ), jmp end; */
    /* rax = l1p0x - l0p0x; */
    "subq %%rax, %%rdi\n"
    "movq %%rdi, %%r8\n"
    "movq %7, %%rax\n"
    "sbbq %%rdx, %%r10\n"
    "subq %3, %%rax\n"
    "orq %%r10, %%r8\n"
    "jne 1f\n"
    /* eax(retval) = 0; (output register) */
    "xorl %%eax, %%eax\n"
    "jmp 3f\n"

    ".p2align 4\n"
    "1:\n"
    /* rsi(xormask) = r10(denom.high) >> 63; */
    /* rbx = l1p1y - l1p0y; (preserved) */
    /* r9:rbx(alpha) = ( l0p1x - l0p0x ) * ( l1p1y - l1p0y ); */
    "movq %%r10, %%rsi\n"
    "imulq %%rbx\n"
    "movq %%rax, %%rbx\n"
    "movq %%rdx, %%r9\n"
    "sarq $63, %%rsi\n"
    /* rax = l1p0y - l0p0y; */
    "movq %8, %%rax\n"
    "subq %4, %%rax\n"
    /* r13 = l1p1x - l1p0x; (preserved) */
    /* rdx:rax = ( l1p0y - l0p0y ) * ( l1p1x - l1p0x ); */
    "imulq %%r13\n"
    /* r10:rdi(denom) ^= xormask; */
    "xorq %%rsi, %%rdi\n"
    "xorq %%rsi, %%r10\n"
    /* r9:rbx(alpha) = imul() - imul(); */
    "subq %%rax, %%rbx\n"
    "sbbq %%rdx, %%r9\n"
    /* r10:rdi(denom) += xormask & 1; */
    "subq %%rsi, %%rdi\n"
    "sbbq %%rsi, %%r10\n"
    /* r9:rbx(alpha) ^= xormask; */
    "xorq %%rsi, %%rbx\n"
    "xorq %%rsi, %%r9\n"
    /* rax = r11 = l0p1x - l0p0x; (preserved) */
    "movq %%r11, %%rax\n"
    /* r9:rbx(alpha) += xormask & 1; */
    "subq %%rsi, %%rbx\n"
    "sbbq %%rsi, %%r9\n"
    /* if( denom.high == 0 ), jmp skip; */
    "testq %%r10, %%r10\n"
    "je 2f\n"
    /* ecx(shift) = 65 - __builtin_clzll( denom.high ); */
    "lzcntq %%r10, %%rsi\n"
    "movl $65, %%ecx\n"
    "subl %%esi, %%ecx\n"
    /* r9:rbx(alpha) >>= shift, feeding bits from r9(alpha.high) */
    "shrdq %%r9, %%rbx\n"
    /* r10:rdi(denom) >>= shift, feeding bits from r10(denom.high) */
    "shrdq %%r10, %%rdi\n"
    /* branch done ~ if( denom.high == 0 ) */
    "2:\n"

    /* rcx = denom >> 1; */
    /* rdx:rax(hitfx) = rbx(alpha.low) * ( l0p1x - l0p0x ); */
    "movq %%rdi, %%rcx\n"
    "imulq %%rbx\n"
    "sarq %%rcx\n"
    /* rsi = -denom >> 1; */
    /* rdx:rax(hitx) = hitfx / rdi(denom.low) */
    /* r8d = 0; r9d = 0; bits will be set for rounding */
    "movq %%rcx, %%rsi\n"
    "idivq %%rdi\n"
    "xorl %%r8d, %%r8d\n"
    "xorl %%r9d, %%r9d\n"
    "negq %%rsi\n"
    /* rax(hitx) += l0p0x; */
    "addq %3, %%rax\n"
    /* rax(hitx) += ( rdx(rem) >= ( denom >> 1 ) ); */
    /* rax(hitx) -= ( rdx(rem) <= -( denom >> 1 ) ); */
    "cmpq %%rcx, %%rdx\n"
    "setge %%r8b\n"
    "cmpq %%rsi, %%rdx\n"
    "setle %%r9b\n"
    "addq %%r8, %%rax\n"
    "subq %%r9, %%rax\n"
    /* %1 = hitptx; (output) */
    "movq %%rax, %1\n"
    /* r12 = l0p1y - l0p0y; (preserved) */
    "movq %%r12, %%rax\n"
    /* rdx:rax(hitfy) = rbx(alpha.low) * ( l0p1y - l0p0y ); */
    "imulq %%rbx\n"
    /* rdx:rax(hity) = hitfy / rdi(denom.low) */
    "idivq %%rdi\n"
    /* hity += l0p0y; */
    "addq %4, %%rax\n"
    /* rax(hity) += ( rdx(rem) >= ( denom >> 1 ) ); */
    /* rax(hity) -= ( rdx(rem) <= -( denom >> 1 ) ); */
    "cmpq %%rcx, %%rdx\n"
    "setge %%r8b\n"
    "cmpq %%rsi, %%rdx\n"
    "setle %%r9b\n"
    "addq %%r8, %%rax\n"
    "subq %%r9, %%rax\n"
    /* %2 = hitpty; (output) */
    "movq %%rax, %2\n"
    /* eax(retval) = 1; (output register) */
    "movl $1, %%eax\n"
    /* exit */
    "3:\n"

    : "=a"(retval), "=&g"(hitptx), "=g"(hitpty)
    : "m" (l0p0x), "m" (l0p0y), "m" (l0p1x), "m" (l0p1y), "m" (l1p0x), "m" (l1p0y), "m" (l1p1x), "m" (l1p1y)
    : "%rbx", "%rcx", "%rdx", "%rdi", "%rsi", "%r8", "%r9", "%r10", "%r11", "%r12", "%r13", "cc", "memory" );

  hitpt[0] = hitptx;
  hitpt[1] = hitpty;
  return retval;
}

Just a quick version for laugh (hey, you are the one who started with inline assembly!).

The count of instructions went down by 30% compared to PB5, but it's only a couple percent faster (since modern CPUs are able to execute a ton of non-dependent superfluous instructions in parallel). I'm sure it can be tweaked, though I don't expect much gain.

EDIT: Tiny bug fix.

Also, quick extra note, 73% of the CPU time is spent waiting for the results of the two idiv r64 on my Zen2-based desktop, where idiv r64 has a latency of 44 cycles. But I realized idiv r64 has a latency of only 17 cycles (!!!) on Zen3-based chips.

I just checked on a Zen3-based Epyc, and the perfectly accurate integer solution is as fast as the various faulty double-based solutions (and divpd is 13 cycles everywhere, both AMD and Intel, since about forever).

AngusJohnson commented 1 year ago

@sergey-239 ...

@AngusJohnson if I understand it right they are close to collinear when the dx difference is small, am I correct? Why not fallback to _PB in its best variant in such cases? This algorithm is based on pure line equation Ax+By+C=0 and the most correct I have ever seen...

Yes, that would be ideal. However the real problem is determining "when the dx difference is small" because it's anything but linear. In fact it's arctan, and IMHO that's way too time costly to be useful.

Edit: Perhaps we could use a very small det in algorithm _A as the cue to branch to a more precise (but time costly) algorithm (for those who need that kind of precision).

AngusJohnson commented 1 year ago

Unless there are objections from current participants, I'll be closing this Issue soon, not because I think it's fully resolved, because of course it isn't 😁. But I'm hoping the integer overflow issue is resolved. And if I'm wrong then please start a new thread for that. But a large part of this Issue has devolved into how best to determine intersection points at the extremes of the integer range and when edges are very close to collinear. I suggest that further discussion on that topic should be done in a new Discussion topic.

alexisnaveros commented 1 year ago

Sure, I would also consider this problem resolved.

Ryzen 3 benchmark, pair of short lines intersecting far from origin:

Benchmark A     : 1312 ms ( error 53494 )
Benchmark B     : 2089 ms ( error 0 )
Benchmark C     : 2098 ms ( error 0 )
Benchmark D     : 2474 ms ( error 0 )
Benchmark E     : 2329 ms ( error 0 )
Benchmark F     : 2135 ms ( error 361 )
Benchmark G     : 2102 ms ( error 0 )
Benchmark H     : 3164 ms ( error 0 )
Benchmark PB3   : 3680 ms ( error 0 )
Benchmark PB4   : 1992 ms ( error 0 )
Benchmark PB5   : 1696 ms ( error 0 )
Benchmark PB6   : 1663 ms ( error 0 )
Benchmark ASM1  : 1559 ms ( error 0 )
Benchmark ASM2  : 1525 ms ( error 0 )

Ryzen 3 benchmark, pair of long lines intersecting far from origin:

Benchmark A     : 1337 ms ( error 6830365379301454 )
Benchmark B     : 2139 ms ( error 9231536556 )
Benchmark C     : 2139 ms ( error 2097591017180 )
Benchmark D     : 2500 ms ( error 9397748378 )
Benchmark E     : 2362 ms ( error 2088963303 )
Benchmark F     : 2171 ms ( error 2088963316 )
Benchmark G     : 2141 ms ( error 2812596175876 )
Benchmark H     : 3267 ms ( error 5218710608 )
Benchmark PB3   : 3643 ms ( error 1 )
Benchmark PB4   : 2891 ms ( error 2 )
Benchmark PB5   : 2874 ms ( error 0 )
Benchmark PB6   : 2825 ms ( error 0 )
Benchmark ASM1  : 2812 ms ( error 0 )
Benchmark ASM2  : 2796 ms ( error 0 )

(the performance of idiv depends on the operands, and older chips with a slow idiv can be 30% slower at the PB-ASM solutions)

People can pick the solution that best fits their needs. Thanks everyone!

AngusJohnson commented 1 year ago

I think it'd be very helpful if someone starts a Discussion on the topic of optimized intersection detection. (I won't be doing this myself as the I don't understand the nuances sufficiently to do it justice.)

alexisnaveros commented 1 year ago

Edit: Perhaps we could use a very small det in algorithm _A as the cue to branch to a more precise (but time costly) algorithm (for those who need that kind of precision).

You could also consider just always using BP5 or BP6... It's only 30-60% (depends on CPU) slower than _A when the idiv r64 operands are small. And if the operands are large (that's when _A totally falls apart), it is going to run 100-200% slower than _A, but providing a correct result (even properly rounded).

I think it might just not be worth detecting inaccuracy to branch between fast and slow paths?

(these 30-60% and 100-200% ranges depend on the specific CPU's performance with idiv r64, unlike divpd which was always granted optimized hardware)

AngusJohnson commented 1 year ago

You could also consider just always using BP5 or BP6... It's only 30-60% (depends on CPU) slower than _A

I can't find either BP5 or BP6 in this massive thread. Some algorithms described here seem to require other libraries (eg for 128bit integers) or specific compilers, and these won't be considered as part of the core Clipper2 library. (Of course, they could be offered as customizations for those who need very high levels of precision.)

alexisnaveros commented 1 year ago

I had to look that up, it appears the Microsoft compiler doesn't support a 128 bits integers type. But they have the intrinsics _mul128(), _udiv128(), shiftright128(), _addcarry_u64(), _subborrow_u64(), lzcnt64() which could be used to implement everything.

So it could be done... although with a #if for MSVC, another for gcc/clang, and a generic fallback.

Here's BP5 and BP6:

static inline int64_t idiv128( __int128 num, int64_t denom )
{
  int64_t res;
  asm (
  "idivq %1\n\t"
  : "=a"(res) : "rm" (denom), "A" (num) );
  return res;
}

/* denom must be >= 0 for proper rounding behavior */
static inline int64_t idiv128_round( __int128 num, int64_t denom )
{
  int64_t res, rem;
  asm (
  "idivq %2\n\t"
  : "=a"(res), "=d"(rem) : "rm" (denom), "A" (num) );
  res += ( rem >= ( denom >> 1 ) );
  res -= ( rem <= -( denom >> 1 ) );
  return res;
}

static inline  __int128 imul128( int64_t a, int64_t b )
{
  __int128 res;
  asm (
  "imulq %2\n\t"
  : "=A"(res) : "a" (a), "rm" (b) );
  return res;
}

/* GCC is smart enough to grab the higher half of the __int128 register pair, without any store/load/shift/whatever */
/* Can all compilers do that? Feel free to tweak as necessary for your compiler to emit good code */
#if 1
 #define MATH_INT128_GET_HIGH64(x) (int64_t)((x)>>64)
#else
 #define MATH_INT128_GET_HIGH64(x) (int64_t)(*((int64_t *)(&x)+1))
#endif

static inline int mathVertexLineLineIntersection_PB5( int64_t *hitpt, int64_t l0p0x, int64_t l0p0y, int64_t l0p1x, int64_t l0p1y, int64_t l1p0x, int64_t l1p0y, int64_t l1p1x, int64_t l1p1y )
{
  int64_t xormask, denomhigh;
  __int128 denom;
  __int128 alpha_num;
  denom = imul128( l0p1x - l0p0x, l1p1y - l1p0y ) - imul128( l0p1y - l0p0y, l1p1x - l1p0x );
  if( denom == 0 )
  {
    // segments lay on parallel lines
    // special case with segments on the same line -- may be considered if necessary
    return 0;
  }
  else
  {
    alpha_num = imul128( l1p0x - l0p0x, l1p1y - l1p0y ) - imul128( l1p0y - l0p0y, l1p1x - l1p0x );
    xormask = MATH_INT128_GET_HIGH64( denom ) >> 63;
    alpha_num = ( alpha_num ^ xormask ) - xormask;
    denom = ( denom ^ xormask ) - xormask;
#if ABORT_IF_HIT_IS_OUTSIDE_LINE_SEGMENTS
    __int128 beta_num;
    if( (unsigned __int128)alpha_num > denom )
      return 0;
    beta_num  = imul128( l1p0x - l0p0x, l0p1y - l0p0y ) - imul128( l1p0y - l0p0y, l0p1x - l0p0x );
    if( (unsigned __int128)beta_num > denom )
      return 0;
#endif
    denomhigh = MATH_INT128_GET_HIGH64( denom );
    if( denomhigh )
    {
      int shift = 65 - __builtin_clzll( denomhigh );
      alpha_num >>= shift;
      denom >>= shift;
    }
    hitpt[0] = l0p0x + idiv128_round( imul128( alpha_num, l0p1x - l0p0x ), denom );
    hitpt[1] = l0p0y + idiv128_round( imul128( alpha_num, l0p1y - l0p0y ), denom );
  }
  return 1;
}

static inline int mathVertexLineLineIntersection_PB6( int64_t *hitpt, int64_t l0p0x, int64_t l0p0y, int64_t l0p1x, int64_t l0p1y, int64_t l1p0x, int64_t l1p0y, int64_t l1p1x, int64_t l1p1y )
{
  int shift;
  int64_t xormask, denomhigh, denomlow;
  int64_t signbit, halfdenom;
  __int128 denom;
  __int128 alpha_num;
  __int128 hitx, hity;
  denom = imul128( l0p1x - l0p0x, l1p1y - l1p0y ) - imul128( l0p1y - l0p0y, l1p1x - l1p0x );
  if( denom == 0 )
  {
    // segments lay on parallel lines
    // special case with segments on the same line -- may be considered if necessary
    return 0;
  }
  else
  {
    alpha_num = imul128( l1p0x - l0p0x, l1p1y - l1p0y ) - imul128( l1p0y - l0p0y, l1p1x - l1p0x );
    xormask = MATH_INT128_GET_HIGH64( denom ) >> 63;
    alpha_num = ( alpha_num ^ xormask ) - xormask;
    denom = ( denom ^ xormask ) - xormask;
#if ABORT_IF_HIT_IS_OUTSIDE_LINE_SEGMENTS
    __int128 beta_num;
    if( (unsigned __int128)alpha_num > denom )
      return 0;
    beta_num  = imul128( l1p0x - l0p0x, l0p1y - l0p0y ) - imul128( l1p0y - l0p0y, l0p1x - l0p0x );
    if( (unsigned __int128)beta_num > denom )
      return 0;
#endif
    denomhigh = MATH_INT128_GET_HIGH64( denom );
    if( denomhigh )
    {
      shift = 65 - __builtin_clzll( denomhigh );
      alpha_num >>= shift;
      denom >>= shift;
    }
    denomlow = (int64_t)denom;
    halfdenom = denomlow >> 1;
    hitx = imul128( alpha_num, l0p1x - l0p0x );
    signbit = MATH_INT128_GET_HIGH64( hitx ) >> 63;
    hitx += ( halfdenom ^ signbit ) - signbit;
    hitpt[0] = l0p0x + idiv128( hitx, denomlow );
    hity = imul128( alpha_num, l0p1y - l0p0y );
    signbit = MATH_INT128_GET_HIGH64( hity ) >> 63;
    hity += ( halfdenom ^ signbit ) - signbit;
    hitpt[1] = l0p0y + idiv128( hity, denomlow );
  }
  return 1;
}

The difference between the two is just rounding occurring after the division (PB5) or before (PB6).

sergey-239 commented 1 year ago

@AngusJohnson @alexisnaveros The PB5+ still has to be fixed: for the cases when denom or alpha_num are in range 2^63..2^64-1 the result will be wrong! I have one idea. I will check it tomorrow (I am too sleepy now) @alexisnaveros, I tested your asm implementation and unfortunately, it performs much worse than PB5 (2.5 times) when point coordinates to be read from independent memory locations (well x and y of the same point are adjacent, but the points are not in the same cache lines). I remember that you did not expect it be a killer, just wanted to inform you.

alexisnaveros commented 1 year ago

@sergey-239 Oh, you mentioned that before and I finally see what you mean, sorry. A positive value between 2^63..2^64-1 will be interpreted as a negative for the imul/idiv. Yeah, we still need to right shift by one when the top 64 bits are all zeroes and the 64th bit is set.

The asm implementation isn't doing anything different with memory, cache misses should hit all paths equally... Unless you mean something happens when gcc tries to keep the 4 pointers to the 4 sets of XY in 4 registers (for the asm block's operands), and there's only 3 free registers because the asm block clobbers 12 of them. Interesting, now I'm curious. I didn't want to give pointers to the asm block because I figured some values could already be in registers (in the case of Clipper2 anyway).

EDIT: Quick fix for 2^63..2^64-1 issue, let me know if you see a better way.

__attribute__ ((always_inline)) static inline int64_t idiv128( __int128 num, int64_t denom )
{
  int64_t res;
  asm (
  "idivq %1\n\t"
  : "=a"(res) : "rm" (denom), "A" (num) );
  return res;
}

__attribute__ ((always_inline)) static inline  __int128 imul128( int64_t a, int64_t b )
{
  __int128 res;
  asm (
  "imulq %2\n\t"
  : "=A"(res) : "a" (a), "rm" (b) );
  return res;
}

/* GCC is smart enough to grab the higher half of the __int128 register pair, without any store/load/shift/whatever */
/* Can all compilers do that? Feel free to tweak as necessary for your compiler to emit good code */
#if 1
 #define MATH_INT128_GET_HIGH64(x) (int64_t)((x)>>64)
#else
 #define MATH_INT128_GET_HIGH64(x) (int64_t)(*((int64_t *)(&x)+1))
#endif

static inline int mathVertexLineLineIntersection_PB6( int64_t *hitpt, int64_t l0p0x, int64_t l0p0y, int64_t l0p1x, int64_t l0p1y, int64_t l1p0x, int64_t l1p0y, int64_t l1p1x, int64_t l1p1y )
{
  int shift;
  int64_t xormask, highmask, denomlow, alphalow;
  int64_t signbit, halfdenom;
  __int128 denom;
  __int128 alpha;
  __int128 hitx, hity;
  denom = imul128( l0p1x - l0p0x, l1p1y - l1p0y ) - imul128( l0p1y - l0p0y, l1p1x - l1p0x );
  if( denom == 0 )
  {
    // segments lay on parallel lines
    // special case with segments on the same line -- may be considered if necessary
    return 0;
  }
  else
  {
    alpha = imul128( l1p0x - l0p0x, l1p1y - l1p0y ) - imul128( l1p0y - l0p0y, l1p1x - l1p0x );
    xormask = MATH_INT128_GET_HIGH64( denom ) >> 63;
    alpha = ( alpha ^ xormask ) - xormask;
    denom = ( denom ^ xormask ) - xormask;
#if ABORT_IF_HIT_IS_OUTSIDE_LINE_SEGMENTS
    __int128 beta;
    if( (unsigned __int128)alpha > denom )
      return 0;
    beta  = imul128( l1p0x - l0p0x, l0p1y - l0p0y ) - imul128( l1p0y - l0p0y, l0p1x - l0p0x );
    beta = ( beta ^ xormask ) - xormask;
    if( (unsigned __int128)beta > denom )
      return 0;
#endif
    highmask = MATH_INT128_GET_HIGH64( denom );
    if( highmask )
    {
      shift = 65 - __builtin_clzll( highmask );
      denom >>= shift;
      denomlow = (int64_t)denom;
    }
    else
    {
      shift = (uint64_t)( (int64_t)denom | ( MATH_INT128_GET_HIGH64( alpha ) ^ (int64_t)alpha ) ) >> 63;
      denomlow = (int64_t)denom;
      if( !shift )
        goto shiftzero;
      denomlow = (int64_t)( (uint64_t)denomlow >> shift );
    }
    alpha >>= shift;
    shiftzero:
    alphalow = (int64_t)alpha;
    halfdenom = denomlow >> 1;
    hitx = imul128( alphalow, l0p1x - l0p0x );
    signbit = MATH_INT128_GET_HIGH64( hitx ) >> 63;
    hitx += ( halfdenom ^ signbit ) - signbit;
    hitpt[0] = l0p0x + idiv128( hitx, denomlow );
    hity = imul128( alphalow, l0p1y - l0p0y );
    signbit = MATH_INT128_GET_HIGH64( hity ) >> 63;
    hity += ( halfdenom ^ signbit ) - signbit;
    hitpt[1] = l0p0y + idiv128( hity, denomlow );
  }
  return 1;
}

static inline int mathVertexLineLineIntersection_ASM2( int64_t *hitpt, int64_t l0p0x, int64_t l0p0y, int64_t l0p1x, int64_t l0p1y, int64_t l1p0x, int64_t l1p0y, int64_t l1p1x, int64_t l1p1y )
{
  int retval;
  int64_t hitptx, hitpty;

  __asm__ __volatile__(
    /* rax = l0p1x - l0p0x; (preserve to r11) */
    /* rbx = l1p1y - l1p0y; (preserve rbx) */
    "movq %5, %%rax\n"
    "movq %10, %%rbx\n"
    "subq %3, %%rax\n"
    "subq %8, %%rbx\n"
    "movq %%rax, %%r11\n"
    /* r10:rdi(denom) = ( l0p1x - l0p0x ) * ( l1p1y - l1p0y ); */
    "imulq %%rbx\n"
    "movq %%rax, %%rdi\n"
    "movq %%rdx, %%r10\n"
    /* rax = l0p1y - l0p0y; (preserve to r12) */
    /* r13 = l1p1x - l1p0x; (preserve r13) */
    "movq %6, %%rax\n"
    "movq %9, %%r13\n"
    "subq %4, %%rax\n"
    "subq %7, %%r13\n"
    "movq %%rax, %%r12\n"
    /* rdx:rax = ( l0p1y - l0p0y ) * ( l1p1x - l1p0x ); */
    "imulq %%r13\n"
    /* r10:rdi(denom) = imul() - imul(); */
    /* retval=0; if( denom == 0 ), jmp end; */
    /* rax = l1p0x - l0p0x; */
    "subq %%rax, %%rdi\n"
    "movq %%rdi, %%rbp\n"
    "movq %7, %%rax\n"
    "sbbq %%rdx, %%r10\n"
    "subq %3, %%rax\n"
    "orq %%r10, %%rbp\n"
    "jne 1f\n"
    /* eax(retval) = 0; (output register) */
    "xorl %%eax, %%eax\n"
    "jmp 3f\n"

    ".p2align 4\n"
    "1:\n"
    /* rsi(xormask) = r10(denom.high) >> 63; */
    /* rbx = l1p1y - l1p0y; (preserved) */
    /* r9:rbx(alpha) = ( l0p1x - l0p0x ) * ( l1p1y - l1p0y ); */
    "movq %%r10, %%rsi\n"
    "imulq %%rbx\n"
    "movq %%rax, %%rbx\n"
    "movq %%rdx, %%r9\n"
    "sarq $63, %%rsi\n"
    /* rax = l1p0y - l0p0y; */
    "movq %8, %%rax\n"
    "subq %4, %%rax\n"
    /* r13 = l1p1x - l1p0x; (preserved) */
    /* rdx:rax = ( l1p0y - l0p0y ) * ( l1p1x - l1p0x ); */
    "imulq %%r13\n"
    /* r10:rdi(denom) ^= xormask; */
    "xorq %%rsi, %%rdi\n"
    "xorq %%rsi, %%r10\n"
    /* r9:rbx(alpha) = imul() - imul(); */
    "subq %%rax, %%rbx\n"
    "sbbq %%rdx, %%r9\n"
    /* r10:rdi(denom) += xormask & 1; */
    "subq %%rsi, %%rdi\n"
    "sbbq %%rsi, %%r10\n"
    /* r9:rbx(alpha) ^= xormask; */
    "xorq %%rsi, %%rbx\n"
    "xorq %%rsi, %%r9\n"
    /* rax = r11 = l0p1x - l0p0x; (preserved) */
    "movq %%r11, %%rax\n"
    /* r9:rbx(alpha) += xormask & 1; */
    "subq %%rsi, %%rbx\n"
    "sbbq %%rsi, %%r9\n"

    /* if( denom.high == 0 ), jmp skip; */
    "testq %%r10, %%r10\n"
    "je 2f\n"
    /* branch ( denom.high != 0 ) */
    /* ecx(shift) = 65 - __builtin_clzll( denom.high ); */
    "lzcntq %%r10, %%rsi\n"
    "movl $65, %%ecx\n"
    "subl %%esi, %%ecx\n"
    /* r10:rdi(denom) >>= shift, feeding bits from r10(denom.high) */
    "shrdq %%r10, %%rdi\n"
    "jmp 4f\n"

    ".p2align 4\n"
    "2:\n"
    /* branch ( denom.high == 0 ) */
    /* shift = (uint64_t)( (int64_t)denom | ( MATH_INT128_GET_HIGH64( alpha ) ^ (int64_t)alpha ) ) >> 63; */
    "movq %%r9, %%rcx\n"
    "xorq %%rbx, %%rcx\n"
    "orq %%rdi, %%rcx\n"
    "shrq $63, %%rcx\n"
    /* skip if %%cl==0 */
    "jz 5f\n"
    /* rdi(denom.low) >>= shift */
    "shrq %%cl, %%rdi\n"

    /* branch done ~ if( denom.high == 0 ) */
    "4:\n"
    /* r9:rbx(alpha) >>= shift, feeding bits from r9(alpha.high) */
    "shrdq %%r9, %%rbx\n"
    /* target, if skip %%cl==0 */
    "5:\n"
    /* rcx = denom >> 1; */
    /* rdx:rax(hitfx) = rbx(alpha.low) * ( l0p1x - l0p0x ); */
    "movq %%rdi, %%rcx\n"
    "imulq %%rbx\n"
    "sarq %%rcx\n"
    /* rcx = denom >> 1; (preserved) */
    /* signbit = rdx(hitx.high) >> 63; */
    /* rdx:rax(hitxf) += ( ( denom >> 1 ) ^ signbit ) - signbit; */
    "movq %%rdx, %%rbp\n"
    "movq %%rcx, %%rsi\n"
    "sarq $63, %%rbp\n"
    "xorq %%rbp, %%rsi\n"
    "subq %%rbp, %%rsi\n"
    "movq %%rsi, %%rbp\n"
    "sarq $63, %%rsi\n"
    "addq %%rbp, %%rax\n"
    "adcq %%rsi, %%rdx\n"
    /* rdx:rax(hitx) = hitfx / rdi(denom.low) */
    /* rsi = -denom >> 1; */
    "idivq %%rdi\n"
    /* rax(hitx) += l0p0x; */
    "addq %3, %%rax\n"
    /* %1 = hitptx; (output) */
    "movq %%rax, %1\n"
    /* r12 = l0p1y - l0p0y; (preserved) */
    "movq %%r12, %%rax\n"
    /* rdx:rax(hitfy) = rbx(alpha.low) * ( l0p1y - l0p0y ); */
    "imulq %%rbx\n"
    /* rcx = denom >> 1; (preserved, destroyed) */
    /* signbit = rdx(hitx.high) >> 63; */
    /* rdx:rax(hitfy) += ( ( denom >> 1 ) ^ signbit ) - signbit; */
    "movq %%rdx, %%rbp\n"
    "sarq $63, %%rbp\n"
    "xorq %%rbp, %%rcx\n"
    "subq %%rbp, %%rcx\n"
    "movq %%rcx, %%rbp\n"
    "sarq $63, %%rcx\n"
    "addq %%rbp, %%rax\n"
    "adcq %%rcx, %%rdx\n"
    /* rdx:rax(hity) = hitfy / rdi(denom.low) */
    "idivq %%rdi\n"
    /* hity += l0p0y; */
    "addq %4, %%rax\n"
    /* %2 = hitpty; (output) */
    "movq %%rax, %2\n"
    /* eax(retval) = 1; (output register) */
    "movl $1, %%eax\n"
    /* exit */
    "3:\n"

    : "=a"(retval), "=&g"(hitptx), "=g"(hitpty)
    : "m" (l0p0x), "m" (l0p0y), "m" (l0p1x), "m" (l0p1y), "m" (l1p0x), "m" (l1p0y), "m" (l1p1x), "m" (l1p1y)
    : "%rbx", "%rcx", "%rdx", "%rdi", "%rsi", "%rbp", "%r9", "%r10", "%r11", "%r12", "%r13", "cc" );

  hitpt[0] = hitptx;
  hitpt[1] = hitpty;
  return retval;
}

(assembly now using %rbp instead of %r8, so it can only compile with -fomit-frame-pointer which is implied by all -O levels (that's faster, instructions using %r8-%r15 need an extra prefix byte))

EDIT: Stupid bug fix, forgot alpha can be negative. EDIT X: Code edited a couple times.

sergey-239 commented 1 year ago

@alexisnaveros quick note:

- beta  = imul128( l1p0x - l0p0x, l0p1y - l0p0y ) - imul128( l1p0y - l0p0y, l0p1x - l0p0x );
+ beta  = (imul128( l1p0x - l0p0x, l0p1y - l0p0y ) - imul128( l1p0y - l0p0y, l0p1x - l0p0x ) ^ xormask) - xormask;

I will look more thoroughly a bit later. Example: denom == -10, beta == -5 (values before negation)

alexisnaveros commented 1 year ago

Right, sorry, version above was buggy. I think it's now okay though I'm not proud of:

    highmask = MATH_INT128_GET_HIGH64( denom );
    if( highmask )
    {
      shift = 65 - __builtin_clzll( highmask );
      denom >>= shift;
      denomlow = (int64_t)denom;
    }
    else
    {
      shift = (uint64_t)( (int64_t)denom | ( MATH_INT128_GET_HIGH64( alpha ) ^ (int64_t)alpha ) ) >> 63;
      denomlow = (int64_t)denom;
      if( !shift )
        goto shiftzero;
      denomlow = (int64_t)( (uint64_t)denomlow >> shift );
    }
    alpha >>= shift;
    shiftzero:
    alphalow = (int64_t)alpha;

in case you have a better idea.