MRtrix3 / mrtrix3

MRtrix3 provides a set of tools to perform various advanced diffusion MRI analyses, including constrained spherical deconvolution (CSD), probabilistic tractography, track-density imaging, and apparent fibre density
http://www.mrtrix.org
Mozilla Public License 2.0
291 stars 179 forks source link

CI: Fixes for MacOSX ARM64 runner #2890

Closed Lestropie closed 5 months ago

Lestropie commented 5 months ago

Addresses #2885. Working off the linked CI errors there.

Targeting master here; will need to either merge from master back to dev or re-implement there.

daljit46 commented 5 months ago

Still puzzled by the vectorstats failure: the corresponding test very clearly shows use of -frac 1e-6 for that call, but the error message claims an absolute tolerance of zero. Will see what gets generated from this PR.

I think you're referencing the wrong test. The test that's failing is on line 4. Indeed, appending -frac 1e-6 to testing_diff_matrix tmpouttvalue_t2.csv vectorstats/1/outtvalue_t2.csv makes the test pass (tested on my MacBook).

Lestropie commented 5 months ago

Line 4, not the fourth non-commented test; got it :sweat_smile:

Lestropie commented 5 months ago

Two tests remain failing even with fairly relaxed tolerances:

Would be thankful if someone could generate these on an ARM64 machine and forward the data so that I can compare them to what mine generates: would prefer to see the differences on these rather than naively dialing back the tolerances further.

bjeurissen commented 5 months ago

Would be thankful if someone could generate these on an ARM64 machine and forward the data so that I can compare them to what mine generates: would prefer to see the differences on these rather than naively dialing back the tolerances further.

m1_outputs_of_failed_tests.zip

daljit46 commented 5 months ago

Ok, so I have spent a frustratingly long time trying to understand why these precision issues arise in the first place. After playing around with the code in src/colourmap.cpp:35, I realised that, in rare cases, the result 2.7213f * amplitude - 1.0f was being evaluated differently than const auto v = 2.7213f * amplitude; v - 1.0f;. This suggested that there was some kind of fusing of those two operations by the compiler. Looking around, I found that GCC and Clang provide the -ffp-contract flag. It seems that this flag has been enabled by default since Clang 14 and onwards. From Xcode 14.3 release notes:

By default, -ffp-contract=ON is set. This option enables shorter and faster floating-point code by fusing floating-point operations like multiplies and adds, but it may impact the accuracy of numerically sensitive applications. (105573336)

Other projects also seem to be affected by this problem, e.g. see here. Now, technically speaking FMA operations should be more accurate than non-FMA operations since they are less likely to be subject to intermediate rounding errors. I don't know why the flag gives different results for Apple's M1 chips (a possible reason here might be that on x86 the flag doesn't actually force FMA operations but only enables them under certain conditions), but turning it off (-DCMAKE_CXX_FLAGS="-ffp-contract=on/off" or CFLAGS="-ffp-contract=on/off" ./configure for old build script) makes the code behave similarly to x86 CPUs in my testing (indeed this PR is unnecessary if the flag is turned off).

@MRtrix3/mrtrix3-devs should we just turn off FMA operations for ARM64 Mac chips?

Lestropie commented 5 months ago

Thanks @bjeurissen; very helpful.

mrcolour failures were due to me using the wrong test criterion. If a colour map assigns a tiny fractional value to a colour channel, should not then test precision within some fraction of that value; results in a threshold less than 32-bit floating-point precision.

tckresample was essentially as I had predicted. As per code here, resampling to a fixed step size has a limited precision, and differences in floating-point operations could accumulate along the length of a streamline to the point that for a given vertex, the two different platforms yield samples that are on opposite sides of that precision window. The tolerance on the test was smaller than the width of that window. I've increased the tolerance on master here, but on dev I'd additionally advocate for tightening the tolerance on that window; it will be more computationally expensive, but I think it makes sense that if requesting a fixed step size, the inter-vertex distances should be pretty precise with respect to that requested step size. For completeness, here's the inter-vertex distances for the most problematic streamline in the test set (check the second last step):

0.497605 0.900021 0.899807 0.899928 0.900478 0.900303 0.899573 0.900134 0.899189 0.21678
0.497599 0.900006 0.899828 0.899928 0.900481 0.900300 0.899574 0.900133 0.900815 0.215153
Lestropie commented 5 months ago

Ooofffffff, race condition :zany_face:

Personally I'd say keep this PR. The tests should ideally pass on any system where the code has been compiled, which means both different hardware and different compiler flags. So some amount of tolerance to account for such is reasonable. We're not dealing with differences of a magnitude that is consequential for the respective operations to be performed on user data. If it were a bigger problem we could discuss generating test data with FFP on, and either using that as reference with tolerance for the errors with FFP off or storing outputs with both configurations and comparing generated data to both references, bu I don't think it's worth the trouble here.

What's the respective timelines on compiler support for FFP contractions vs. hardware support for such opcodes? If it's both potentially faster and more precise, there's an argument to be made to, rather than disabling such to have less inter-system variance, instead enable such across the board, including for precompiled binaries. But only if they're long-standing x86 operations that compilers are only now starting to utilise.

daljit46 commented 5 months ago

Personally I'd say keep this PR.

If we believe the new tolerances to be more sensible, I agree. My only concern here is that the differences in results between ARM64 and x86 chips could arbitrarily propagate to large values (e.g. in a for loop that involves floating-point operations). If this is the case, then our commands can give significantly different results depending on the platform. I guess we can deal with it when we encounter an issue of this nature.

What's the respective timelines on compiler support for FFP contractions vs. hardware support for such opcodes?

Any modern x86 CPU (e.g. any CPU released in the last 10 years by Intel or AMD) should support FMA operations (C++ even provides implementations in the standard library). Both GCC and Clang now enable these optimisations by default.