AngusJohnson / Clipper2

Polygon Clipping and Offsetting - C++, C# and Delphi
Boost Software License 1.0
1.54k stars 279 forks source link

Unexpected collinearity test failure on M1 macOS #777

Closed kintel closed 7 months ago

kintel commented 9 months ago

I'm experiencing something that looks like unexpected collinearity check failures (tested on main as well as latest release). When unioning the two triangles below, the result is two triangles, but should be a single triangle as the vertices on the X axis are clearly collinear. I'm also not using the full 64-bit range (just approx 31 bits), so it feels like this should be well within the operating range of Clipper.

A failing C++ test case:

#include <gtest/gtest.h>
#include "clipper2/clipper.h"
using namespace Clipper2Lib;
TEST(Clipper2Tests, TestUnionIssue) {
    Paths64 subject;
    subject.push_back(MakePath({0, -453054451,0, -433253797,-455550000, 0}));
    subject.push_back(MakePath({0, -433253797,0, 0,-455550000, 0}));
    Clipper64 clipper;
    clipper.PreserveCollinear(false);
    clipper.AddSubject(subject);
    Paths64 solution;
    clipper.Execute(ClipType::Union, FillRule::NonZero, solution);
    ASSERT_EQ(solution.size(), 1);
    EXPECT_EQ(solution[0].size(), 3);
    EXPECT_EQ(IsPositive(subject[0]), IsPositive(solution[0]));
}

Note: This is a minimal failing example. What I'm actually trying to achieve is to project a 3D cone onto one axis and unioning all the individual triangles until I get a single triangular silhouette.

kintel commented 9 months ago

Update: I wrote the above late last night. With a clearer head, I realize that this isn't a collinearity check failure, it's a straight-up union failure. The result of unioning two neighboring triangles should be a single contour.

AngusJohnson commented 9 months ago

I can't duplicate this problem.

  Paths64 subject;
  subject.push_back(MakePath({ 0, -453054451,0, -433253797,-455550000, 0 }));
  subject.push_back(MakePath({ 0, -433253797,0, 0,-455550000, 0 }));
  Clipper64 clipper;
  clipper.PreserveCollinear(false);
  clipper.AddSubject(subject);
  Paths64 solution;
  clipper.Execute(ClipType::Union, FillRule::NonZero, solution);
  std::cout << solution << std::endl;

0,0, -455550000,0, 0,-453054451

kintel commented 9 months ago

Thanks for feedback. I've only tested this on macOS aarch64. I'll try on x86 Linux as well in case this differs based on arch, OS or compiler.

kintel commented 9 months ago

Works on x86 Linux, fails on M1 macOS: std::cout << solution << std::endl; gives me:

0,-433253797,0, -455550000,0,0, 0,-453054451,0
0,0,0, -455550000,0,0, 0,-433253797,0
kintel commented 9 months ago

PS! I see that GitHub now has a macos-14 CI runner in Beta, which runs on M1. Haven't tested it, but it could be a quick path to setting up a repro case if you don't have access to a physical Mac. Let me know if I can assist with anything.

olologin commented 9 months ago

Just my 5 cents, maybe this can help you somehow, because I have some experience in working with M1 macs, here are some potential differencies that can cause troubles:

In your place I would first try to find trigonometric function that introduces difference, but I am not sure if clipper uses any for simple clipping (I believe it uses sin and cos for offsetting, but we are not talking about offsetting). Second option is to try adding -ffp-contract=off and checking if issue goes away. If it does - some multiply-add operation is producing slightly different result and this reveals a bug.

ADDED: Forgot to mention, most of the time all issues except first two are reproducible on any CPU and OS with Clang + libc++ build. You can try reproducing it like this as well, maybe it will work.

kintel commented 9 months ago

Also, while discussion robustness across CPUs and compilers, it would be interesting to do a few things:

olologin commented 9 months ago

Also, another thing: On MacOS FP settings of the parent thread (rounding, exception mask) are not inherited by the new std::thread, I believe this is also implementation detail of libc++. @kintel Maybe you are calling clipper from a new thread and your rounding is not properly configured?

kintel commented 9 months ago

This issue fails on Clipper's unit tests when adding a new test case, so this shouldn't be related to threading issues.

olologin commented 9 months ago

Yep, -ffp-contract=off hides the issue. Just tested it on M1 mac at work, with the latest AppleClang :)

This fails: cmake ../ -GNinja -DCMAKE_BUILD_TYPE=Debug && ninja ClipperTests && ./ClipperTests --gtest_filter=Clipper2Tests.TestUnionIssue

This succeeds: cmake ../ -GNinja -DCMAKE_CXX_FLAGS="-ffp-contract=off" -DCMAKE_BUILD_TYPE=Debug && ninja ClipperTests && ./ClipperTests --gtest_filter=Clipper2Tests.TestUnionIssue

Maybe it will be reproducible with a normal Clang on ARM cpu, but I could not reproduce it with Clang 16 on x64 CPU.

olologin commented 9 months ago

So it is fnmsub instruction in M1 build that is causing a difference, and it is the CrossProduct function that ends up having this opcode. I added some debug info to it and started the test:

  template <typename T>
  inline double CrossProduct(const Point<T>& pt1, const Point<T>& pt2, const Point<T>& pt3) {
    const double d = (static_cast<double>(pt2.x - pt1.x) * static_cast<double>(pt3.y -
      pt2.y) - static_cast<double>(pt2.y - pt1.y) * static_cast<double>(pt3.x - pt2.x));

    std::cout.precision(20);
    std::cout << "Inputs:" << std::endl;
    std::cout << pt1.x << ", " << pt1.y << std::endl;
    std::cout << pt2.x << ", " << pt2.y << std::endl;
    std::cout << pt3.x << ", " << pt3.y << std::endl;
    std::cout << "Cross product:" << std::endl;
    std::cout << d << std::endl;
    return d;
  }

And here is the difference between M1 build (on the left side) and AMD64 build (on the right) https://www.diffchecker.com/kaHSyAnl/

I guess to reproduce this problem on a normal AMD64 Angus can try to use those results from the left side of the diff.

kintel commented 9 months ago

I was wondering why this was not defaulting to =off and found this: https://releases.llvm.org/14.0.0/tools/clang/docs/ReleaseNotes.html#floating-point-support-in-clang

olologin commented 9 months ago

Better way to reproduce issue on non-M1 mac, somehow with std::fma different CPUs are producing the same result and it starts to fail on AMD64 the same was as on M1 mac. Magic.

  template <typename T>
  inline double CrossProduct(const Point<T>& pt1, const Point<T>& pt2, const Point<T>& pt3) {
    const auto a = static_cast<double>(pt2.x - pt1.x);
    const auto b = static_cast<double>(pt3.y - pt2.y);
    const auto c = static_cast<double>(pt2.y - pt1.y);
    const auto d = static_cast<double>(pt3.x - pt2.x);
    //const double e = (a * b - c * d);
    const double e = std::fma(a, b, -c * d);

    std::cout.precision(20);
    std::cout << "Inputs:" << std::endl;
    std::cout << pt1.x << ", " << pt1.y << std::endl;
    std::cout << pt2.x << ", " << pt2.y << std::endl;
    std::cout << pt3.x << ", " << pt3.y << std::endl;
    std::cout << "Cross product:" << std::endl;
    std::cout << e << std::endl;
    return e;
  }
olologin commented 9 months ago

I believe there are two separate questions: 1) Clipper produces incorrect result because of FP contractions, but can this issue show up on slightly different input without FP contractions? 2) Should clipper try to produce the same results on different CPUs? In this case we have to either recommend people to turn off fp contractions for ARM cpus (This is what we do at work), or manually go over all fmadd and fnmsub instructions (~20 in total currently) and force corresponding expressions in code to use std::fma to prevent optimizer from using fmadd and fnmsub only on ARM cpus.

kintel commented 9 months ago

In terms of 2.: Yes, that would be nice, especially in terms of Clipper being used as part of a CAD kernel. Having to worry about on which CPU a design is evaluated would be suboptimal.

AngusJohnson commented 9 months ago

Thanks for all the suggestions. However, please note that

  1. the clipping engine doesn't use any trigonometric or logarithmic functions
  2. the library should be robust even when different compilers generate slightly different floating point values. The worst that should happen is that an intersection coordinate may shift by at most 1 unit in either or both of x and y.
  3. I've made no promises that touching segments will be joined, just that the library makes every reasonable effort to do so 😁.

So with that preamble, am I right in understanding that it's the CrossProduct function that tests for collinearity is problematic on some Mac machines?

If this is so, I suggest that there's a weakness in the Mac float math routines since Clipper CrossProduct testing of two identical segments should return 0. The math in the CrossProduct function is very simple - 5 subtractions & 2 multiplications - though admittedly there are also the type conversions from int64 to double (that avoids potential integer overflow). Clipper could only avoid the safety of this type conversion by either by limiting coordinate ranges (x & y values between +/-2,147,483,648) or by introducing 128bit math where the latter would significantly slow clipping.

kintel commented 9 months ago

In terms of 3. - it would be immensely beneficial with some sort of promise for certain, severely limited, coordinate ranges.

@olologin dug a bit deeper than me, but I want to highlight two things:

  1. In my failing example, the coordinate values were well within the mentioned range: [-450M, 0] in my case.
  2. Clipper1 works perfectly for this example, using a similar range of values.
AngusJohnson commented 9 months ago
  1. In my failing example, the coordinate values were well within the mentioned range: [-450M, 0] in my case.

It would significantly reduce performance if Clipper were to test coordinate ranges prior to clipping (in order to use different functions depending on those ranges). This isn't something I'm currently prepared to consider.

  1. Clipper1 works perfectly for this example, using a similar range of values.

Clipper1 resorts to 128bit math which does significantly impede performance.

kintel commented 9 months ago

It would significantly reduce performance if Clipper were to test coordinate ranges prior to clipping (in order to use different functions depending on those ranges). This isn't something I'm currently prepared to consider.

Understandable :) I had hoped that keeping my values below 32 bits would buy me some space for temporary results fitting within int64.

Interestingly, if I shave off 1 more bit and limit my coordinates to 30-bit space, the union operation works. In the event that this compiler/CPU issue doesn't get resolved, perhaps a workaround could be to give some recommendation on coordinate ranges. I see the current recommendation is to stay within ~10^15. Staying within 10^9 seems to work in this scenario.

Anyway, I expect more investigation is needed before concluding anything.

olologin commented 9 months ago

@kintel I wonder if defining CLIPPER2_HI_PRECISION helps

kintel commented 9 months ago

Turning on CLIPPER2_HI_PRECISION yields the same failure, and also makes Clipper2Tests.TestMultiplePolygons fail.

Note: This option is technically implemented by the CMakeLists, but doesn't work: It doesn't correctly pass on the boolean value to the macro, and also doesn't pass it on to clients linking with Clipper (like the tests), but I tested by simply #if 1 the code path in question.

AngusJohnson commented 9 months ago

Perhaps one way around this issue might be using the following untested code. Instead of directly calling CrossProduct() to test for collinearity do the following:

inline bool IsCollinear(const Point64& pt1, const Point64& sharedPt, const Point64& pt2)
{
#ifdef __APPLE__
  double cp = CrossProduct(pt1, sharedPt, pt2);
  return std::fabs(cp) < 0.00000001;
#else
  return CrossProduct(pt1, sharedPt, pt2) == 0;
#endif
}
kintel commented 8 months ago

One comment: We need more tests, but my understanding based on the above discussion is that this is a compiler issue (clang) for a specific instruction set (aarch64), so we should really check for that, not __APPLE__. Perhaps just __aarch64__ is a good start.

AngusJohnson commented 8 months ago

One comment: We need more tests, but my understanding based on the above discussion is that this is a compiler issue (clang) for a specific instruction set (aarch64), so we should really check for that, not __APPLE__. Perhaps just __aarch64__ is a good start.

I'm happy to change __APPLE__ to __aarch64__.

HOwever, I'm having to rely on others to test this, since this isn't an issue with the Visual Studio compiler I'm using. Evidently, according to your OP above, your compiler (and perhaps others too) will return only a near zero (float) value when assessing collinearity of the following vertices: {0, -453054451}, {0, -433253797}, {0, 0} and/or {0, 0}, {-455550000, 0}, {-455550000, 0}

kintel commented 8 months ago

I'll try two things:

  1. Verify that my Mac-related issues are fixed in my environment
  2. See if I can get a test running with Linux, clang and aarch64. If I can reproduce it there, I'll reopen this.

Until then, let's consider this closed.

olologin commented 8 months ago

Just my 2 cents: I wouldn't say it is a compiler or hardware issue. I think it is computation stability problem. As I have shown before, you can reproduce this on x86 if you use std::fma which gives slightly different result. Nothing prevents future versions of compilers from rearranging FP instructions in a way that will produce results similar to std::fma.

Although, knowing how hard it is to estimate precision and write computationally stable code, I cannot really blame Angus for closing this :) It might be easier to wait until it is hit by someone on much more common configurations.

kintel commented 8 months ago

Oh, right, I forgot about that detail. Perhaps the ideal solution (if anyone has time to build and test that) would be to perform a compiler check before building and adjust the code accordingly?

AngusJohnson commented 7 months ago

This issue has reappeared in the latest commit, and I can't see why it has done so. Since I don't have any Apple devices I can't realistically debug it, so perhaps someone who does can shed light on this.

olologin commented 7 months ago

@AngusJohnson Did you try to reproduce it like this: https://github.com/AngusJohnson/Clipper2/issues/777#issuecomment-1954339748 ? Because it was failing on x86 after such change the same way as on MacOS

AngusJohnson commented 7 months ago

@AngusJohnson Did you try to reproduce it like this: [#777 (comment)]

No I didn't. Sorry. And looking at this problem again I can now see a very simple solution.

  inline bool IsCollinear(const Point64& pt1, 
    const Point64& sharedPt, const Point64& pt2) // #777
  {
    const auto a = static_cast<double>(sharedPt.x - pt1.x);
    const auto b = static_cast<double>(pt2.y - sharedPt.y);
    const auto c = static_cast<double>(sharedPt.y - pt1.y);
    const auto d = static_cast<double>(pt2.x - sharedPt.x);
    return a * b == c * d;
  }
kintel commented 7 months ago

Thanks! I'll resume my clipper2 migration soon and will report back.

Btw, it's possible to ssh into GitHub runners. As long as you're comfortable debugging on the cmd-line that's a possible option for accessing HW/SW you don't have physical access to.

AngusJohnson commented 7 months ago

Btw, it's possible to ssh into GitHub runners. As long as you're comfortable debugging on the cmd-line that's a possible option for accessing HW/SW you don't have physical access to.

Sadly this is beyond my current capabilities 😢.

reunanen commented 7 months ago

Regarding this:

const auto a = static_cast<double>(sharedPt.x - pt1.x);
const auto b = static_cast<double>(pt2.y - sharedPt.y);
const auto c = static_cast<double>(sharedPt.y - pt1.y);
const auto d = static_cast<double>(pt2.x - sharedPt.x);
return a * b == c * d;

While it looks like a great improvement, I was wondering why there's still casting to double, as opposed to no casting at all (or explicit casts to int64_t).

I guess so that a * b or c * d would not overflow? That makes sense, but instead there may be rounding (when a number exceeds 2^53), so the calculation may not really work:

// an integer not representable by double
const int64_t i = 9007199254740993;

const double a = i;
const double b = i * 100;
const double c = i * 10;
const double d = i * 10;

printf(a * b == c * d ? "equal" : "NOT equal");

This prints NOT equal, which wouldn't happen if double could represent integers of this magnitude accurately.

I'm not sure what kinds of values IsCollinear typically gets. I guess it's possible that casting to double may lead to a better tradeoff here (than using int64_t instead), although my gut feeling may suggest otherwise. So if it's a conscious design decision, I'm very much fine with it. But if not, then I might want to propose this instead (because at least currently it seems that the input coordinates are always 64-bit integer):

const auto a = sharedPt.x - pt1.x;
const auto b = pt2.y - sharedPt.y;
const auto c = sharedPt.y - pt1.y;
const auto d = pt2.x - sharedPt.x;
return a * b == c * d;
AngusJohnson commented 7 months ago

I guess so that a b or c d would not overflow?

That is indeed the only reason.

That makes sense, but instead there may be rounding (when a number exceeds 2^53), so the calculation may not really work.

Isn't limiting coordinate ranges to ±253 (1016) much better than limiting it to ±std::sqrt(std::numeric_limits::max) (109) as you seem to be proposing?

reunanen commented 7 months ago

I'm not sure. That's what I was thinking of as well. But in the above example (starting with const int64_t i = 9007199254740993;), for example, using int64_t gives the right answer when double does not. I think integer overflow will never change a true statement (a * b == c * d) to be false (because both sides will just overflow equally), but it might do the opposite (if a * b overflows and c * d does not, then they might end up looking equal). Not sure which kind of error (truefalse or falsetrue) is worse in this specific use case.

Anyway double is 64-bit just like int64_t, so it's not like it can represent more numbers faithfully. (In fact I believe it can represent fewer integers accurately, because it can represent also non-integers.)

But anyway I believe that with doubles there can be rounding even when the result of the multiplication exceeds 253, meaning that the reliable input range (of the coordinates themselves) is much lower.

AngusJohnson commented 7 months ago

But anyway I believe that with doubles there can be rounding even when the result of the multiplication exceeds 253

Ah, that now makes sense.

Not sure which kind of error (true→false or false→true) is worse in this specific use case.

I suggest that potential errors when testing for collinear with very large coordinate values isn't critical, though of course this should be documented. The only alternative AFAICS is to use int128 (which is what I've done in Clipper1), but this would again significantly affect performance.

Anyhow, you have persuaded me against casting to double, However, I'm not sure how the various compilers will respond when there are overflow errors. In Delphi, I would enclose the function in {$OVERFLOWCHECKS OFF} ... {$OVERFLOWCHECKS ON} directives.

reunanen commented 7 months ago

However, I'm not sure how the various compilers will respond when there are overflow errors.

Yeah, that's a good point. I think at minimum, there should be a test case that gets executed as part of the GitHub Actions tests. I could try and add one for C++?

reunanen commented 7 months ago

As for C++, I have updated the PR to use unsigned integers, for which the result should be well-defined: the values should just wrap (regardless of compiler settings it seems), which I believe is exactly what we want. I have also added a simple test case (which would fail for the current version that uses double).

reunanen commented 7 months ago

The only alternative AFAICS is to use int128 (which is what I've done in Clipper1), but this would again significantly affect performance.

Guessing it could be a separate code branch that gets invoked only when really needed. But even that would of course affect performance (at least a bit), not to mention making the code more complex.