Closed df7cb closed 1 year ago
Yeah, we've heard this before. We target only x86-64 and only with GCC so it makes sense that some tests would be failing. It's most likely got to do with the precision of some of the default types like int
, size_t
or double
. In theory we should be robust against that but in practice we do hit the limits sometimes, especially when it's got to do with something quadratic like the areas in those infill tests. Or in the highlighted case in this report, the float
being used to store vertex coordinates.
We don't maintain other architectures or compilers because we don't have the resources for it, neither in manpower nor in hardware. We do accept fixes for it though, such as these:
The Debian i386 targets Pentium CPUs and upwards, which means that it doesn't make use of MMX or SSE technology, instead generating FPU instructions. Since the x86 FPU uses 80-bit floating point numbers internally, the math precision will be probably better than for SSE in many cases.
I don't think it's good practice to assume a certain precision (or lack thereof!) in unit tests. If a inaccuracy is expected, the results should be compared with a variable error margin. Or, integers should be used exclusively.
See here: https://randomascii.wordpress.com/2012/02/25/comparing-floating-point-numbers-2012-edition/
That being said, I don't think these inaccuracies are serious, unless they add up in some way. I think it would be ok to make them acceptable them in the unit tests. What do you think, @Ghostkeeper ?
Yeah the floating point rounding errors are usually handled by using GoogleTest's EXPECT_FLOAT_EQ
instead of just the normal EXPECT_EQ
.
However in this case we're comparing the integer coordinates that result from a computation that started with the floating point vertex coordinates from the cube_vertices.txt
file. Apparently those floating point rounding errors became significant enough to be several microns off, so perhaps it is adding up somehow, here.
I worked around this in Debian with this patch for now:
--- a/tests/arcus/ArcusCommunicationPrivateTest.cpp
+++ b/tests/arcus/ArcusCommunicationPrivateTest.cpp
@@ -247,7 +247,11 @@ TEST_F(ArcusCommunicationPrivateTest, Re
// - Then, just compare:
for (int i = 0; i < 3; ++i)
{
+#ifdef __i386__
+ EXPECT_LE(abs((max_coords[i] - min_coords[i]) - (raw_max_coords[i] - raw_min_coords[i])), 2);
+#else
EXPECT_EQ(max_coords[i] - min_coords[i], raw_max_coords[i] - raw_min_coords[i]);
+#endif
}
}
Not sure if that's really a solution though. This seems to correctly give an error: Floating point rounding errors shouldn't have the order of magnitude of a micron.
To measure this, look at this calculator online: https://www.h-schmidt.net/FloatConverter/IEEE754.html Assuming that most printers are below 500mm wide, I fill in the value of 499.90001 and it rounds it up to 499.9000244. If I fill in a value of 499.900009 it rounds it down to 499.8999939. So around the 500mm position, a 32-bit floating point would have up to about 0.000015mm rounding errors, or 0.015 micrometres. Why would it be a whole micrometre off?
I think the test is correct, but there's just a bug in the software there.
So, this is still an issue for Cura 5.x, but we're getting the errors in different unit tests now.
Namely, in PolygonConnectorTest
when it tests getDist2BetweenLineSegments
, and in IntPointTest.TestRotationMatrix
.
21: [ RUN ] PolygonConnectorTest.getBridgeNestedSquares
21: ./tests/utils/PolygonConnectorTest.cpp:71: Failure
21: Expected equality of these values:
21: LinearAlg2D::getDist2BetweenLineSegments(bridge->a.from_point, bridge->a.to_point, bridge->b.from_point, bridge->b.to_point)
21: Which is: 9801
21: 100 * 100
21: Which is: 10000
21: The bridges should be spaced 1 line width (100 units) apart.
...
18: [ RUN ] IntPointTest.TestRotationMatrix
18: ./tests/utils/IntPointTest.cpp:24: Failure
18: Expected equality of these values:
18: rotated_in_place
18: Which is: (11,20)
18: rotated_in_place_2
18: Which is: (10,20)
18: Matrix composition with translate and rotate failed.
I haven't found the source of the problem in getDist2BetweenLineSegments
yet, but for IntPointTest
, it is in PointMatrix.apply
. In this function, we have several multiply+add operations with double and long long operands, followed by truncation to long long.
I initially thought it had something to do with rounding modes that are slightly different between the FPU (baseline for i686) and SSE2 (baseline for amd64). But, after some investigation, it seems like implicit truncation is used in both cases. However, it is very likely that the result of the multiplication with the rotation matrix results in values that either slightly below or above the target value. When truncating the result (basically, round towards zero), this can lead to off-by-one errors if the result is slightly below the target (for positive values, opposite for negative). If the result is slightly above, it leads to the correct integer result.
Now, I was able to verify this easily: When I rounded the results explicitly with a call to nearbyint(), the unit test no longer failed on i686:
#include <cfenv> //For nearbyint.
Point3 apply(const Point3 p) const
{
return Point3(nearbyint(p.x * matrix[0] + p.y * matrix[1] + p.z * matrix[2])
, nearbyint(p.x * matrix[3] + p.y * matrix[4] + p.z * matrix[5])
, nearbyint(p.x * matrix[6] + p.y * matrix[7] + p.z * matrix[8]));
}
What really baffles me, though, is that the compiler doesn't choke on the code in this function. The components of the affine matrix are double
s, while the operand and the return value contain long long int
s.
Shouldn't this trigger a type mismatch? class Point
only has a constructor that accepts long long
s, so I don't see how this obvious loss of precision can possibly be valid C++ code.
Unfortunately, the above doesn't fix the error in getDist2BetweenLineSegments
, so there's probably other rounding issues lurking elsewhere.
I wouldn't spend to much effort here. Arcus is likely to be deprecated in the near future, since we're implementing a CuraEngine plugin system based on gRPC. See https://github.com/Ultimaker/CuraEngine/pull/1878
Those gRPC services will probably replace the functionality that Arcus provides.
It is unlikely that we will put in the work to fix anything other then a blocking bug in libArcus.
@jellespijker Ok, but is this really an issue with Arcus? It seems like a fundamental issue within CuraEngine, in that integer math was used in most places, but while still doing floating point calculations for some things (like rotation and length calculation). Are you going to deprecate this functionality in CuraEngine?
@Ghostkeeper I don't know if you're still working on this project, but perhaps you could still have a look and provide some wisdom?
By the way, I think the more signification deviations we're seeing in getDist2BetweenLineSegments
are simply an accumulation of off-by-one truncation errors in each integer result being added together, but I'll have to verify this.
@Ghostkeeper is no longer working for UltiMaker.
The point I was trying to convey is that this library has been "stable" for quite some time the only function it has is to transfer the mesh and settings to the engine and the resulting gcode back tonthe front-end.
If that functionality doesn't break we won't put in the work in this library.
Now with respects to floats and integer type usage in CuraEngine; yes there are mixed types and therefor rounding error. Some of this happens due to our inconsistent style, some because dependencies such as boost geometry needs to work with floating point type when calculate the Voronoi diagrams.
We are working on unifying that. Slowly but steadily. With modern types, concepts and statical analysis.
Once we switch to the gRPC services for communication between Cura and these message types will be under scrutiny once again. This is something that will probably happen in Cura 5.5 or 5.6.
Hence my suggestion to spend time wisely, and maybe not on Unit Test on this libraries for architectures that UltiMaker doesn't support any longer. We simply won't prioritize that over all other work that is still on our backlog
@onitake I just now noticed that this issue is in CuraEngine, I read your comments on my phone, where the title was cutoff to Arcus..., I was therefore under the impression that this discussion happend in the libArcus repository over a failing Unit Test.
My apologies.
That being said most of my previously made statement still stand:
std::integral auto
and std::floating_point auto
over concrete types such as double
, float
, coord_t
let the compiler figure it out for us.Thanks @jellespijker , that sounds indeed promising.
In the meantime, we'll probably "fix" the rounding issues on i686 with a Debian-only workaround, so we can hit it off the ground again. I still wonder why the compiler doesn't complain about the silent truncation from double to long long, though. Maybe there is something else amiss here. I'll keep investigating.
And it's sad to hear that @Ghostkeeper is no longer working on Cura... I believe a large part of the project was contributed by them?
So, it turns out that implicit conversion from double to long long is actually standard behavior in C++ (and C). For the definition, see https://en.cppreference.com/w/cpp/language/implicit_conversion (section Floating–integral conversions)
Here's an example that illustrates this:
#include <cstdio>
#include <cmath>
long long mul(long long a, double b) {
return a * b;
}
long long rmul(long long a, double b) {
return std::llrint(a * b);
}
volatile long long a = 10;
volatile double b = 0.499999999999999972;
volatile double c = 0.499999999999999973;
int main(int argc, char **argv) {
long long d = mul(a, b);
long long e = mul(a, c);
long long f = rmul(a, b);
long long g = rmul(a, c);
printf("a=%lld b=%f c=%f d=%lld e=%lld f=%lld g=%lld\n", a, b, c, d, e, f, g);
return 0;
}
The values of b and c are represented by 0x3fdfffffffffffff and 0x3fe0000000000000, respectively. As you can see, b is one ULP below 0.5, while c is converted to 0.5. This results in d=4 (truncated) and e=5 (exact). No error is raised, despite the loss of precision. With additional rounding, both results will be 5.
I pushed a PR with the rounding changes I made to fix the unit tests. This will basically be the patch we're going to use in Debian for now, and perhaps it will help someone else facing similar issues or serve as future reference for other rounding/type conversion changes.
You can close it if you think it's not appropriate, I just wanted to put it here for completeness sake.
Application Version 4.4.1
Platform 32 bit i386 Debian Linux
Steps to Reproduce Building 4.4.1 and master as of e904e260716 on Debian unstable fails on various architectures (only amd64, mipsel, sparc64, and x32 are fine)
On 32-bit i386 the problem is:
Full build log at https://buildd.debian.org/status/fetch.php?pkg=cura-engine&arch=i386&ver=1%3A4.4.1-1&stamp=1580249085&raw=0
Other architectures are listed here: https://buildd.debian.org/status/logs.php?pkg=cura-engine&ver=1%3A4.4.1-1