AcademySoftwareFoundation / OpenColorIO

A color management framework for visual effects and animation.
https://opencolorio.org
BSD 3-Clause "New" or "Revised" License
1.79k stars 455 forks source link

Failing tests when FMA instruction is used #1784

Open kxxt opened 1 year ago

kxxt commented 1 year ago

Hi,

Some tests are failing on riscv64 architecture because the results on riscv64 is more precise than other architectures.

For example, test GammaOpCPU, apply_moncurve_mirror_style_fwd in tests/cpu/ops/gamma/GammaOpCPU_tests.cpp fails:

Detailed output from failed test ```c++ rgba[0] = 0.94786727428436279 0.83333331346511841 0.71428573131561279 0.625 rgba[1] = 0.052132699638605118 0.1666666716337204 0.28571429848670959 0.375 rgba[2] = 2.4000000953674316 2.2000000476837158 2 1.7999999523162842 rgba[3] = 0.039285715669393539 0.1666666716337204 0.40000000596046448 0.75 rgba[4] = 0.077380158007144928 0.44192609190940857 0.8163265585899353 0.98202729225158691 Iteration 0 sign = 1 1 1 1 pixel = 0.00050000002374872565 0.004999999888241291 0.05000000074505806 0.75 repro: pixel[2]=1028443341, blu[0] = 1060559726, blu[1] = 1049774373 intermediate = 0.052606634795665741 0.17083333432674408 0.32142859697341919 0.84375 data = 0.00085211289115250111 0.02049555629491806 0.10331634432077408 0.73652046918869019 out = 3.8690079236403108e-05 0.0022096303291618824 0.040816329419612885 0.73652046918869019 Iteration 1 sign = -1 -1 -1 -1 pixel = 0.00050000002374872565 0.004999999888241291 0.05000000074505806 0.75 repro: pixel[2]=1028443341, blu[0] = 1060559726, blu[1] = 1049774373 intermediate = 0.052606634795665741 0.17083333432674408 0.32142859697341919 0.84375 data = 0.00085211289115250111 0.02049555629491806 0.10331634432077408 0.73652046918869019 out = -3.8690079236403108e-05 -0.0022096303291618824 -0.040816329419612885 -0.73652046918869019 Iteration 2 sign = 1 1 1 1 pixel = 0.25 0.5 0.75 1 repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373 intermediate = 0.28909951448440552 0.58333331346511841 0.82142859697341919 1 data = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1 out = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1 Iteration 3 sign = -1 -1 -1 -1 pixel = 0.25 0.5 0.75 1 repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373 intermediate = 0.28909951448440552 0.58333331346511841 0.82142859697341919 1 data = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1 out = -0.050876077264547348 -0.30550399422645569 -0.67474496364593506 -1 Iteration 4 sign = 1 1 1 1 pixel = 0.80000001192092896 0.94999998807907104 1 1.5 repro: pixel[2]=1065353216, blu[0] = 1060559726, blu[1] = 1049774373 intermediate = 0.81042653322219849 0.95833331346511841 1 1.3125 data = 0.60382729768753052 0.91061854362487793 1 1.6314687728881836 out = 0.60382729768753052 0.91061854362487793 1 1.6314687728881836 Iteration 5 sign = -1 -1 -1 -1 pixel = 0.80000001192092896 0.94999998807907104 1 1.5 repro: pixel[2]=1065353216, blu[0] = 1060559726, blu[1] = 1049774373 intermediate = 0.81042653322219849 0.95833331346511841 1 1.3125 data = 0.60382729768753052 0.91061854362487793 1 1.6314687728881836 out = -0.60382729768753052 -0.91061854362487793 -1 -1.6314687728881836 Iteration 6 sign = 1 1 1 1 pixel = 1.0049999952316284 1.0499999523162842 1.5 1 repro: pixel[2]=1069547520, blu[0] = 1060559726, blu[1] = 1049774373 intermediate = 1.0047392845153809 1.0416666269302368 1.3571429252624512 1 data = 1.0114120244979858 1.0939645767211914 1.8418369293212891 1 out = 1.0114120244979858 1.0939645767211914 1.8418369293212891 1 Iteration 7 sign = -1 -1 -1 -1 pixel = 1.0049999952316284 1.0499999523162842 1.5 1 repro: pixel[2]=1069547520, blu[0] = 1060559726, blu[1] = 1049774373 intermediate = 1.0047392845153809 1.0416666269302368 1.3571429252624512 1 data = 1.0114120244979858 1.0939645767211914 1.8418369293212891 1 out = -1.0114120244979858 -1.0939645767211914 -1.8418369293212891 -1 Iteration 8 sign = -1 1 1 1 pixel = inf inf nan 0 repro: pixel[2]=2143289344, blu[0] = 1060559726, blu[1] = 1049774373 intermediate = inf inf nan 0.375 data = inf inf nan 0.17110247910022736 out = -inf inf nan 0 Result[0] = 3.8689999200869352e-05 Image [0] = 3.8690079236403108e-05 Result[1] = 0.0022096300963312387 Image [1] = 0.0022096303291618824 Result[2] = 0.040816318243741989 Image [2] = 0.040816329419612885 Result[3] = 0.73652046918869019 Image [3] = 0.73652046918869019 Result[4] = -3.8689999200869352e-05 Image [4] = -3.8690079236403108e-05 Result[5] = -0.0022096300963312387 Image [5] = -0.0022096303291618824 Result[6] = -0.040816318243741989 Image [6] = -0.040816329419612885 Result[7] = -0.73652046918869019 Image [7] = -0.73652046918869019 Result[8] = 0.050876069813966751 Image [8] = 0.050876077264547348 Result[9] = 0.30550399422645569 Image [9] = 0.30550399422645569 Result[10] = 0.67474484443664551 Image [10] = 0.67474496364593506 /usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710: FAILED: Index: 10 - Values: 0.674744964 and: 0.674744844 - Threshold: 1.00000001e-07 Result[11] = 1 Image [11] = 1 Result[12] = -0.050876069813966751 Image [12] = -0.050876077264547348 Result[13] = -0.30550399422645569 Image [13] = -0.30550399422645569 Result[14] = -0.67474484443664551 Image [14] = -0.67474496364593506 /usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710: FAILED: Index: 14 - Values: -0.674744964 and: -0.674744844 - Threshold: 1.00000001e-07 Result[15] = -1 Image [15] = -1 Result[16] = 0.60382729768753052 Image [16] = 0.60382729768753052 Result[17] = 0.91061854362487793 Image [17] = 0.91061854362487793 Result[18] = 1 Image [18] = 1 Result[19] = 1.6314687728881836 Image [19] = 1.6314687728881836 Result[20] = -0.60382729768753052 Image [20] = -0.60382729768753052 Result[21] = -0.91061854362487793 Image [21] = -0.91061854362487793 Result[22] = -1 Image [22] = -1 Result[23] = -1.6314687728881836 Image [23] = -1.6314687728881836 Result[24] = 1.0114120244979858 Image [24] = 1.0114120244979858 Result[25] = 1.0939645767211914 Image [25] = 1.0939645767211914 Result[26] = 1.8418365716934204 Image [26] = 1.8418369293212891 /usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710: FAILED: Index: 26 - Values: 1.84183693 and: 1.84183657 - Threshold: 1.00000001e-07 Result[27] = 1 Image [27] = 1 Result[28] = -1.0114120244979858 Image [28] = -1.0114120244979858 Result[29] = -1.0939645767211914 Image [29] = -1.0939645767211914 Result[30] = -1.8418365716934204 Image [30] = -1.8418369293212891 /usr/src/debug/opencolorio/OpenColorIO-2.2.1/tests/cpu/ops/gamma/GammaOpCPU_tests.cpp:710: FAILED: Index: 30 - Values: -1.84183693 and: -1.84183657 - Threshold: 1.00000001e-07 Result[31] = -1 Image [31] = -1 Result[32] = -inf Image [32] = -inf Result[33] = inf Image [33] = inf Result[34] = nan Image [34] = nan Result[35] = 0 Image [35] = 0 ```

Note that I modified src/OpenColorIO/ops/gamma/GammaOpCPU.cpp and tests/cpu/ops/gamma/GammaOpCPU.cpp to get more outputs.

Modified source ```c++ void GammaMoncurveMirrorOpCPUFwd::apply(const void* inImg, void* outImg, long numPixels) const { const float* in = (const float*)inImg; float* out = (float*)outImg; const float red[5] = { m_red.scale, m_red.offset, m_red.gamma, m_red.breakPnt, m_red.slope }; const float grn[5] = { m_green.scale, m_green.offset, m_green.gamma, m_green.breakPnt, m_green.slope }; const float blu[5] = { m_blue.scale, m_blue.offset, m_blue.gamma, m_blue.breakPnt, m_blue.slope }; const float alp[5] = { m_alpha.scale, m_alpha.offset, m_alpha.gamma, m_alpha.breakPnt, m_alpha.slope }; std::cout.precision(17); for (int i = 0; i < 5; i++) { std::cout << "rgba[" << i << "] = " << red[i] << " " << grn[i] << " " << blu[i] << " " << alp[i] << std::endl; } for (long idx = 0; idx < numPixels; ++idx) { std::cout << "Iteration " << idx << std::endl; const float sign[4] = { std::copysign(1.0f, in[0]), std::copysign(1.0f, in[1]), std::copysign(1.0f, in[2]), std::copysign(1.0f, in[3]) }; std::cout << "sign = " << sign[0] << " " << sign[1] << " " << sign[2] << " " << sign[3] << std::endl; const float pixel[4] = { std::fabs(in[0]), std::fabs(in[1]), std::fabs(in[2]), std::fabs(in[3]) }; std::cout << "pixel = " << pixel[0] << " " << pixel[1] << " " << pixel[2] << " " << pixel[3] << std::endl; std::cout << "repro: pixel[2]=" << std::hex << *(int*)(&pixel[2]) << ", blu[0] = " << *(int*)(&blu[0]) << ", blu[1] = " << *(int*)(&blu[1]) << std::endl; const float intermediate[4] = { pixel[0] * red[0] + red[1], pixel[1] * grn[0] + grn[1], pixel[2] * blu[0] + blu[1], pixel[3] * alp[0] + alp[1] }; std::cout << "intermediate = " << intermediate[0] << " " << intermediate[1] << " " << intermediate[2] << " " << intermediate[3] << std::endl; const float data[4] = { std::pow(intermediate[0], red[2]), std::pow(intermediate[1], grn[2]), std::pow(intermediate[2], blu[2]), std::pow(intermediate[3], alp[2]) }; std::cout << "data = " << data[0] << " " << data[1] << " " << data[2] << " " << data[3] << std::endl; out[0] = sign[0] * (pixel[0] <= red[3] ? pixel[0] * red[4] : data[0]); out[1] = sign[1] * (pixel[1] <= grn[3] ? pixel[1] * grn[4] : data[1]); out[2] = sign[2] * (pixel[2] <= blu[3] ? pixel[2] * blu[4] : data[2]); out[3] = sign[3] * (pixel[3] <= alp[3] ? pixel[3] * alp[4] : data[3]); std::cout << "out = " << out[0] << " " << out[1] << " " << out[2] << " " << out[3] << std::endl; in += 4; out += 4; } ``` ```c++ void ApplyGamma(const OCIO::OpRcPtr & op, float * image, const float * result, long numPixels, unsigned line, float errorThreshold) { std::cout.precision(10); const auto cpu = op->getCPUOp(true); OCIO_CHECK_NO_THROW_FROM(cpu->apply(image, image, numPixels), line); for(long idx=0; idx<(numPixels*4); ++idx) { std::cout << "Result[" << idx << "] = " << std::setw(16) << result[idx] << std::endl; std::cout << "Image [" << idx << "] = " << std::setw(16) << image[idx] << std::endl; if (OCIO::IsNan(result[idx])) { OCIO_CHECK_ASSERT_FROM(OCIO::IsNan(image[idx]), line); continue; } // Using rel error with a large minExpected value of 1 will transition // from absolute error for expected values < 1 and // relative error for values > 1. const bool equalRel = OCIO::EqualWithSafeRelError(image[idx], result[idx], errorThreshold, 1.0f); if (!equalRel) { // As most of the error thresholds are 1e-7f, the output // value precision should then be bigger than 7 digits // to highlight small differences. std::ostringstream message; message.precision(9); message << "Index: " << idx << " - Values: " << image[idx] << " and: " << result[idx] << " - Threshold: " << errorThreshold; OCIO_CHECK_ASSERT_MESSAGE_FROM(0, message.str(), line); } } } ```

Let's take a detailed look at the Iteration 2 from the log:

Iteration 2
sign = 1 1 1 1
pixel = 0.25 0.5 0.75 1
repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.28909951448440552 0.58333331346511841 0.82142859697341919 1
data = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1
out = 0.050876077264547348 0.30550399422645569 0.67474496364593506 1

On x86_64 with SSE2 disabled, it looks like:

Iteration 2
sign = 1 1 1 1
pixel = 0.25 0.5 0.75 1
repro: pixel[2]=1061158912, blu[0] = 1060559726, blu[1] = 1049774373
intermediate = 0.28909951448440552 0.58333331346511841 0.82142853736877441 1
data = 0.050876077264547348 0.30550399422645569 0.67474484443664551 1
out = 0.050876077264547348 0.30550399422645569 0.67474484443664551 1

The difference of the 3rd element(at index 2) causes a test failure later.

x86_64 : intermediate[2] = 0.82142853736877441
riscv64: intermediate[2] = 0.82142859697341919

After some investigation, I found that the difference is caused by FMA(Fused Multiply-Add).

The following expression

pixel[2] * blu[0] + blu[1]

can be computed with two steps:

fmul.s <a>, <a>, <b>
fadd.s <result>,<a> , <c>

It is equivalent to round(round(a * b) + c), which is the behavior on x86_64.

However, with the current build config in this repo, gcc uses fmadd.s to compute the expr.

fmadd.s <result_register> <a> <b> <c> computes a * b + c as round(a * b + c), which is of higher precision than round(round(a * b) + c). And this difference finally caused the test failure.

There are several ways to fix this and I would like to ask for your opinions.

  1. Adjust the expected values and error threshold in the test . In my opinion, this is the most correct solution but it will involve a lot of work.
  2. Add the compiler flag: -ffp-contract=off to prevent the compiler from generating FMA instructions. Apparently this is not recommended because we will lose the benefits of the higher precision we get for free from FMA.
kxxt commented 1 year ago

This is also reproducible on x86_64 platforms with the following steps

CXXFLAGS='-mfma' cmake -GNinja  -Bbuild  -DOCIO_USE_SSE=OFF
ninja -C build
cd build/tests/cpu
./test_cpu_exec

Failed tests: https://pastebin.pl/view/fd57c16e

I can send a PR to fix the failing tests when FMA is used.

doug-walker commented 1 year ago

Thank you for investigating this @kxxt. I think we would welcome a PR that fixes the test as long as the tests for our usual compilers don't get significantly worse. So there should be some mechanism for deciding when to use the looser tests. Do you have a suggestion for how to do that?

Thanks for posting the output from the failing tests. It looks like the first one will already be fixed in this pending PR: https://github.com/AcademySoftwareFoundation/OpenColorIO/pull/1687

The following ones in CPUProcessor_tests.cpp look like they could be fixed by allowing a tolerance of +/- 1. However, we should do this in a way that does not remove the exact test for our usual compilers, some sort of "loose mode".

The file format tests could be skipped when in this mode.

The GammaOpCPU_tests.cpp tolerance could be loosened to 2e-7 when in this mode.

kxxt commented 1 year ago

Hi @doug-walker ,

So there should be some mechanism for deciding when to use the looser tests.

I don't think there is any reliable way to detect FMA other than compiling a sample program and compare its output value. For gcc, -ffp-contract is set to fast by default. The bug doesn't occur on x86_64 because gcc by default targets generic x86_64 that doesn't have FMA instructions.

as long as the tests for our usual compilers don't get significantly worse.

Technically speaking, the tests are incorrect and looser tests are correct. The language standard makes no guarantee that we will get the hard-coded results in the tests. Whether and how expressions are contracted is implementation-defined. [^1]

A floating expression may be contracted, that is, evaluated as though it were an atomic operation, thereby omitting rounding errors implied by the source code and the expression evaluation method. The FP_CONTRACT pragma in <math.h> provides a way to disallow contracted expressions. Otherwise, whether and how expressions are contracted is implementation-defined.

[^1]: C99 standard, page 80, https://www.dii.uchile.cl/~daespino/files/Iso_C_1999_definition.pdf

markreidvfx commented 6 months ago

1950 might have resolved this issue. It fixes a few unit test failures related to FMA on apple silicon.