[4.4.1] CuraEngine regression tests fail on 32bit i386 Debian: ArcusCommunicationPrivateTest.cpp:250: Failure

df7cb commented 4 years ago

Application Version 4.4.1

Platform 32 bit i386 Debian Linux

Steps to Reproduce Building 4.4.1 and master as of e904e260716 on Debian unstable fails on various architectures (only amd64, mipsel, sparc64, and x32 are fine)

On 32-bit i386 the problem is:

[100%] Built target PolygonTest
make[2]: Verzeichnis „/srv/debian/3d/cura-engine/cura-engine.git/obj-i686-linux-gnu“ wird verlassen
/usr/bin/cmake -E cmake_progress_start /srv/debian/3d/cura-engine/cura-engine.git/obj-i686-linux-gnu/CMakeFiles 0
make[1]: Verzeichnis „/srv/debian/3d/cura-engine/cura-engine.git/obj-i686-linux-gnu“ wird verlassen
   dh_auto_test -O--buildsystem=cmake
    cd obj-i686-linux-gnu && make -j4 test ARGS\+=-j4
make[1]: Verzeichnis „/srv/debian/3d/cura-engine/cura-engine.git/obj-i686-linux-gnu“ wird betreten
Running tests...
/usr/bin/ctest --force-new-ctest-process -j4
Test project /srv/debian/3d/cura-engine/cura-engine.git/obj-i686-linux-gnu
      Start  1: BuildTests
      Start  2: GCodeExportTest
      Start  3: InfillTest
      Start  4: LayerPlanTest
 1/21 Test  #2: GCodeExportTest ..................   Passed    0.01 sec
      Start  5: MergeInfillLinesTest
 2/21 Test  #5: MergeInfillLinesTest .............   Passed    0.01 sec
      Start  6: TimeEstimateCalculatorTest
 3/21 Test  #6: TimeEstimateCalculatorTest .......   Passed    0.01 sec
      Start  7: SlicePhaseTest
 4/21 Test  #7: SlicePhaseTest ...................   Passed    0.35 sec
      Start  8: SettingsTest
 5/21 Test  #8: SettingsTest .....................   Passed    0.01 sec
      Start  9: ArcusCommunicationTest
 6/21 Test  #9: ArcusCommunicationTest ...........   Passed    0.01 sec
      Start 10: ArcusCommunicationPrivateTest
 7/21 Test #10: ArcusCommunicationPrivateTest ....***Failed    0.01 sec
[==========] Running 4 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 4 tests from ArcusCommunicationPrivateTest
[ RUN      ] ArcusCommunicationPrivateTest.ReadGlobalSettingsMessage
[       OK ] ArcusCommunicationPrivateTest.ReadGlobalSettingsMessage (1 ms)
[ RUN      ] ArcusCommunicationPrivateTest.ReadSingleExtruderSettingsMessage
[       OK ] ArcusCommunicationPrivateTest.ReadSingleExtruderSettingsMessage (0 ms)
[ RUN      ] ArcusCommunicationPrivateTest.ReadMultiExtruderSettingsMessage
[       OK ] ArcusCommunicationPrivateTest.ReadMultiExtruderSettingsMessage (1 ms)
[ RUN      ] ArcusCommunicationPrivateTest.ReadMeshGroupMessage
/srv/debian/3d/cura-engine/cura-engine.git/tests/arcus/ArcusCommunicationPrivateTest.cpp:250: Failure
Expected equality of these values:
  max_coords[i] - min_coords[i]
    Which is: 9900
  raw_max_coords[i] - raw_min_coords[i]
    Which is: 9898
/srv/debian/3d/cura-engine/cura-engine.git/tests/arcus/ArcusCommunicationPrivateTest.cpp:250: Failure
Expected equality of these values:
  max_coords[i] - min_coords[i]
    Which is: 9900
  raw_max_coords[i] - raw_min_coords[i]
    Which is: 9898
/srv/debian/3d/cura-engine/cura-engine.git/tests/arcus/ArcusCommunicationPrivateTest.cpp:250: Failure
Expected equality of these values:
  max_coords[i] - min_coords[i]
    Which is: 9900
  raw_max_coords[i] - raw_min_coords[i]
    Which is: 9899
[  FAILED  ] ArcusCommunicationPrivateTest.ReadMeshGroupMessage (1 ms)
[----------] 4 tests from ArcusCommunicationPrivateTest (3 ms total)

[----------] Global test environment tear-down
[==========] 4 tests from 1 test suite ran. (3 ms total)
[  PASSED  ] 3 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] ArcusCommunicationPrivateTest.ReadMeshGroupMessage

 1 FAILED TEST

      Start 11: AABBTest
 8/21 Test #11: AABBTest .........................   Passed    0.00 sec
      Start 12: AABB3DTest
 9/21 Test #12: AABB3DTest .......................   Passed    0.01 sec
      Start 13: IntPointTest
10/21 Test #13: IntPointTest .....................   Passed    0.00 sec
      Start 14: LinearAlg2DTest
11/21 Test #14: LinearAlg2DTest ..................   Passed    0.01 sec
      Start 15: MinimumSpanningTreeTest
12/21 Test #15: MinimumSpanningTreeTest ..........   Passed    0.01 sec
      Start 16: PolygonConnectorTest
13/21 Test #16: PolygonConnectorTest .............   Passed    0.02 sec
      Start 17: PolygonTest
14/21 Test #17: PolygonTest ......................   Passed    0.01 sec
      Start 18: PolygonUtilsTest
15/21 Test #18: PolygonUtilsTest .................   Passed    0.01 sec
      Start 19: SparseGridTest
16/21 Test #19: SparseGridTest ...................   Passed    0.01 sec
      Start 20: StringTest
17/21 Test #20: StringTest .......................   Passed    0.00 sec
      Start 21: UnionFindTest
18/21 Test #21: UnionFindTest ....................   Passed    0.00 sec
19/21 Test  #1: BuildTests .......................   Passed    1.06 sec
20/21 Test  #3: InfillTest .......................   Passed    6.58 sec
21/21 Test  #4: LayerPlanTest ....................   Passed   12.62 sec

95% tests passed, 1 tests failed out of 21

Total Test time (real) =  12.62 sec

The following tests FAILED:
     10 - ArcusCommunicationPrivateTest (Failed)
Errors while running CTest
make[1]: *** [Makefile:110: test] Fehler 8
make[1]: Verzeichnis „/srv/debian/3d/cura-engine/cura-engine.git/obj-i686-linux-gnu“ wird verlassen

Full build log at https://buildd.debian.org/status/fetch.php?pkg=cura-engine&arch=i386&ver=1%3A4.4.1-1&stamp=1580249085&raw=0

Other architectures are listed here: https://buildd.debian.org/status/logs.php?pkg=cura-engine&ver=1%3A4.4.1-1

Ghostkeeper commented 4 years ago

Yeah, we've heard this before. We target only x86-64 and only with GCC so it makes sense that some tests would be failing. It's most likely got to do with the precision of some of the default types like int, size_t or double. In theory we should be robust against that but in practice we do hit the limits sometimes, especially when it's got to do with something quadratic like the areas in those infill tests. Or in the highlighted case in this report, the float being used to store vertex coordinates.

We don't maintain other architectures or compilers because we don't have the resources for it, neither in manpower nor in hardware. We do accept fixes for it though, such as these:

onitake commented 4 years ago

The Debian i386 targets Pentium CPUs and upwards, which means that it doesn't make use of MMX or SSE technology, instead generating FPU instructions. Since the x86 FPU uses 80-bit floating point numbers internally, the math precision will be probably better than for SSE in many cases.

I don't think it's good practice to assume a certain precision (or lack thereof!) in unit tests. If a inaccuracy is expected, the results should be compared with a variable error margin. Or, integers should be used exclusively.

See here: https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/

That being said, I don't think these inaccuracies are serious, unless they add up in some way. I think it would be ok to make them acceptable them in the unit tests. What do you think, @Ghostkeeper ?

Ghostkeeper commented 4 years ago

Yeah the floating point rounding errors are usually handled by using GoogleTest's EXPECT_FLOAT_EQ instead of just the normal EXPECT_EQ.

However in this case we're comparing the integer coordinates that result from a computation that started with the floating point vertex coordinates from the cube_vertices.txt file. Apparently those floating point rounding errors became significant enough to be several microns off, so perhaps it is adding up somehow, here.

df7cb commented 4 years ago

I worked around this in Debian with this patch for now:

--- a/tests/arcus/ArcusCommunicationPrivateTest.cpp
+++ b/tests/arcus/ArcusCommunicationPrivateTest.cpp
@@ -247,7 +247,11 @@ TEST_F(ArcusCommunicationPrivateTest, Re
     // - Then, just compare:
     for (int i = 0; i < 3; ++i)
     {
+#ifdef __i386__
+        EXPECT_LE(abs((max_coords[i] - min_coords[i]) - (raw_max_coords[i] - raw_min_coords[i])), 2);
+#else
         EXPECT_EQ(max_coords[i] - min_coords[i], raw_max_coords[i] - raw_min_coords[i]);
+#endif
     }
 }

Ghostkeeper commented 4 years ago

Not sure if that's really a solution though. This seems to correctly give an error: Floating point rounding errors shouldn't have the order of magnitude of a micron.

To measure this, look at this calculator online: https://www.h-schmidt.net/FloatConverter/IEEE754.html Assuming that most printers are below 500mm wide, I fill in the value of 499.90001 and it rounds it up to 499.9000244. If I fill in a value of 499.900009 it rounds it down to 499.8999939. So around the 500mm position, a 32-bit floating point would have up to about 0.000015mm rounding errors, or 0.015 micrometres. Why would it be a whole micrometre off?

I think the test is correct, but there's just a bug in the software there.

onitake commented 1 year ago

So, this is still an issue for Cura 5.x, but we're getting the errors in different unit tests now. Namely, in PolygonConnectorTest when it tests getDist2BetweenLineSegments, and in IntPointTest.TestRotationMatrix.

21: [ RUN      ] PolygonConnectorTest.getBridgeNestedSquares
21: ./tests/utils/PolygonConnectorTest.cpp:71: Failure
21: Expected equality of these values:
21:   LinearAlg2D::getDist2BetweenLineSegments(bridge->a.from_point, bridge->a.to_point, bridge->b.from_point, bridge->b.to_point)
21:     Which is: 9801
21:   100 * 100
21:     Which is: 10000
21: The bridges should be spaced 1 line width (100 units) apart.
...
18: [ RUN      ] IntPointTest.TestRotationMatrix
18: ./tests/utils/IntPointTest.cpp:24: Failure
18: Expected equality of these values:
18:   rotated_in_place
18:     Which is: (11,20)
18:   rotated_in_place_2
18:     Which is: (10,20)
18: Matrix composition with translate and rotate failed.

I haven't found the source of the problem in getDist2BetweenLineSegments yet, but for IntPointTest, it is in PointMatrix.apply. In this function, we have several multiply+add operations with double and long long operands, followed by truncation to long long.

I initially thought it had something to do with rounding modes that are slightly different between the FPU (baseline for i686) and SSE2 (baseline for amd64). But, after some investigation, it seems like implicit truncation is used in both cases. However, it is very likely that the result of the multiplication with the rotation matrix results in values that either slightly below or above the target value. When truncating the result (basically, round towards zero), this can lead to off-by-one errors if the result is slightly below the target (for positive values, opposite for negative). If the result is slightly above, it leads to the correct integer result.

Now, I was able to verify this easily: When I rounded the results explicitly with a call to nearbyint(), the unit test no longer failed on i686:

#include <cfenv> //For nearbyint.

    Point3 apply(const Point3 p) const
    {
        return Point3(nearbyint(p.x * matrix[0] + p.y * matrix[1] + p.z * matrix[2])
                    , nearbyint(p.x * matrix[3] + p.y * matrix[4] + p.z * matrix[5])
                    , nearbyint(p.x * matrix[6] + p.y * matrix[7] + p.z * matrix[8]));
    }

What really baffles me, though, is that the compiler doesn't choke on the code in this function. The components of the affine matrix are doubles, while the operand and the return value contain long long ints. Shouldn't this trigger a type mismatch? class Point only has a constructor that accepts long longs, so I don't see how this obvious loss of precision can possibly be valid C++ code.

Unfortunately, the above doesn't fix the error in getDist2BetweenLineSegments, so there's probably other rounding issues lurking elsewhere.

jellespijker commented 1 year ago

I wouldn't spend to much effort here. Arcus is likely to be deprecated in the near future, since we're implementing a CuraEngine plugin system based on gRPC. See https://github.com/Ultimaker/CuraEngine/pull/1878

Those gRPC services will probably replace the functionality that Arcus provides.

It is unlikely that we will put in the work to fix anything other then a blocking bug in libArcus.

onitake commented 1 year ago

@jellespijker Ok, but is this really an issue with Arcus? It seems like a fundamental issue within CuraEngine, in that integer math was used in most places, but while still doing floating point calculations for some things (like rotation and length calculation). Are you going to deprecate this functionality in CuraEngine?

onitake commented 1 year ago

@Ghostkeeper I don't know if you're still working on this project, but perhaps you could still have a look and provide some wisdom?

By the way, I think the more signification deviations we're seeing in getDist2BetweenLineSegments are simply an accumulation of off-by-one truncation errors in each integer result being added together, but I'll have to verify this.

jellespijker commented 1 year ago

@Ghostkeeper is no longer working for UltiMaker.

The point I was trying to convey is that this library has been "stable" for quite some time the only function it has is to transfer the mesh and settings to the engine and the resulting gcode back tonthe front-end.

If that functionality doesn't break we won't put in the work in this library.

Now with respects to floats and integer type usage in CuraEngine; yes there are mixed types and therefor rounding error. Some of this happens due to our inconsistent style, some because dependencies such as boost geometry needs to work with floating point type when calculate the Voronoi diagrams.

We are working on unifying that. Slowly but steadily. With modern types, concepts and statical analysis.

Once we switch to the gRPC services for communication between Cura and these message types will be under scrutiny once again. This is something that will probably happen in Cura 5.5 or 5.6.

Hence my suggestion to spend time wisely, and maybe not on Unit Test on this libraries for architectures that UltiMaker doesn't support any longer. We simply won't prioritize that over all other work that is still on our backlog

jellespijker commented 1 year ago

@onitake I just now noticed that this issue is in CuraEngine, I read your comments on my phone, where the title was cutoff to Arcus..., I was therefore under the impression that this discussion happend in the libArcus repository over a failing Unit Test.

My apologies.

That being said most of my previously made statement still stand:

We don't have the resources to put in the work for older architectures.
we're aware of mix match in types floating and integer.
we're modernisering our fundamental types, preferring concepts and auto such as std::integral auto and std::floating_point auto over concrete types such as double, float, coord_t let the compiler figure it out for us.
we have a couple of tickets on the backlog to change the polygons and points to standard container types. and to ensure that these types are recognized by dependencies such as boost geometry such that we don't have to switch between floating points and integer.

onitake commented 1 year ago

Thanks @jellespijker , that sounds indeed promising.

In the meantime, we'll probably "fix" the rounding issues on i686 with a Debian-only workaround, so we can hit it off the ground again. I still wonder why the compiler doesn't complain about the silent truncation from double to long long, though. Maybe there is something else amiss here. I'll keep investigating.

And it's sad to hear that @Ghostkeeper is no longer working on Cura... I believe a large part of the project was contributed by them?

onitake commented 1 year ago

So, it turns out that implicit conversion from double to long long is actually standard behavior in C++ (and C). For the definition, see https://en.cppreference.com/w/cpp/language/implicit_conversion (section Floating–integral conversions)

Here's an example that illustrates this:

#include <cstdio>
#include <cmath>

long long mul(long long a, double b) {
    return a * b;
}

long long rmul(long long a, double b) {
    return std::llrint(a * b);
}

volatile long long a = 10;
volatile double b = 0.499999999999999972;
volatile double c = 0.499999999999999973;

int main(int argc, char **argv) {
    long long d = mul(a, b);
    long long e = mul(a, c);
    long long f = rmul(a, b);
    long long g = rmul(a, c);
    printf("a=%lld b=%f c=%f d=%lld e=%lld f=%lld g=%lld\n", a, b, c, d, e, f, g);
    return 0;
}

The values of b and c are represented by 0x3fdfffffffffffff and 0x3fe0000000000000, respectively. As you can see, b is one ULP below 0.5, while c is converted to 0.5. This results in d=4 (truncated) and e=5 (exact). No error is raised, despite the loss of precision. With additional rounding, both results will be 5.

onitake commented 1 year ago

I pushed a PR with the rounding changes I made to fix the unit tests. This will basically be the patch we're going to use in Debian for now, and perhaps it will help someone else facing similar issues or serve as future reference for other rounding/type conversion changes.

You can close it if you think it's not appropriate, I just wanted to put it here for completeness sake.

Ultimaker / CuraEngine

[4.4.1] CuraEngine regression tests fail on 32bit i386 Debian: ArcusCommunicationPrivateTest.cpp:250: Failure #1192