Improving CPU performance via math-related compiler optimization flags

cedrik-fuoco-adsk commented 1 year ago

As is well known, there are some compiler flags that may be used to allow the optimizer to speed up CPU performance. One of these is called -ffast-math. However, that option has problems that make it unsuitable for use with OCIO. For example, it can sometimes change the floating-point behavior of applications that link to the library using it. It also includes options such as -ffinite-math-only which interfere with NaN handling mechanisms that OCIO uses.

However, there is a subset of options that -ffast-math turns on that are worth exploring as part of an option called -funsafe-math-optimizations. This turns on a set of several sub-optimizations. We found that turning on just three of these gives the same amount of speed-up as turning on -ffast-math (or -funsafe-math-optimizations).

These flags are: -fno-signed-zeros, -freciprocal-math, and -fassociative-math. Turning off any one of these prevents the others from being effective and there is no performance gain.

Using ocioperf with a custom CLF file that is heavy on calculation, using most of the OCIO transforms that include SIMD instructions, I get the following results:

The result was about a 25% speed-up for the default processing path, using SIMD intrinsics. The pure C++ side saw minimal speed-ups, but that path is not typically used. These tests were done on an Apple Macbook Pro with an M1 processor. This was running in native ARM mode and using sse2neon to leverage Neon SIMD instructions.

We suspect that for the type of calculations done in OCIO, -fno-signed-zeros, and -freciprocal-math should be harmless. The -fassociative-math allows the compiler to re-order arithmetic operations and we suspect it is mostly harmless but needs more investigation.

Enabling the three flags causes a fair number of CPU unit tests to fail. Initial investigation seems to indicate these are all due to rounding differences. For example, a test comparing integer pixel values may be expecting exactly 32565. Without the flag, the floating-point result would be 32565.4982 and that would get rounded down to 32565. But with the flags, you'd get 32565.51 and that would get rounded up to 32566, causing the test to fail.

This is arguably more a problem with the OCIO tests that are doing exact comparisons of integer values rather than taking into account there may be very slight variations that could cause values near 0.5 to round one way or the other.

Note that these options seem to have much more benefit on the Mac/ARM than on Intel, where we saw less than a 10% speed-up (and in that test -fno-trapping-math was used too).

We're logging this issue to collect feedback from the community as to whether you would like to see these options enabled in the OCIO build.

I've attached the CLF file that I was using as well as the logs from the CPU unit tests. heavy_transform.zip

no-signed-zeros - associative-math - reciprocal-math.zip

doug-walker commented 1 year ago

UPDATE: As mentioned above, these three math flags have a bigger impact on Mac/ARM than on Linux/Intel. We took an action item from the TSC discussion to try running the Mac/ARM tests with GCC rather than Clang. Unfortunately, homebrew does not seem to have the newer versions of GCC that have ARM support, so we did not have an easy way to try this. We did however use Clang on Linux/Intel. Clang itself gave a 15% speed-up over GCC but the math flags themselves discussed above did not add any additional speed-up. Our Linux/Intel tests were done using WSL on Windows, so ideally we would like to re-run on a pure Linux machine, at some point.

remia commented 1 year ago

I have no particular experience on playing with those flags, looking at https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html for -fassociative-math definitively sounds a bit scary though. Like @doug-walker I'd like to see more performance numbers on GCC/Clang/MSVC (if similar flags exists) to see the overall impact (could also confirm if it's really only beneficial for ARM which seem to be likely from the above).

If we go ahead with such flags enabled by default (global or only for ARM), I think we also need a better understanding of where they make a difference and isolate the specifics area in the code. Profiling and playing with the CLF transforms might allow us to better pinpoint their effect rather than treating the library as a whole.

heshpdx commented 1 year ago

@cedrik-fuoco-adsk Could you share the command line you used for the measurements, and where you put heavy_transform.clf before you ran?

cedrik-fuoco-adsk commented 1 year ago

Hi @heshpdx,

I am using ocioperf (in the bin directory where OCIO is installed) which is provided by OCIO. As long as you can access the executable in your terminal environment, you can call ocioperf and provide the path to the CLF file.

ocioperf --transform /path/to/heavy_transform.clf

If you want to test with the different compiler flags, OCIO needs to be built against those flags. Example: cmake -E env CXXFLAGS="-fno-signed-zeros -freciprocal-math -fassociative-math" cmake [...]

cedrik-fuoco-adsk commented 1 year ago

Here are some data points from an Ubuntu machine with an Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz. I've built OCIO without any changes and built OCIO with some combinations of the mentioned compiler options. (using the same heavy_transforms.clf as mentioned in first post)

heshpdx commented 1 year ago

Thank you @cedrik-fuoco-adsk and @doug-walker for heavy_transform.clf. We've been playing with this and verifying on some different arch/compiler combos with the imprecise math you are showing above (for the sake of benchmarking, see #1768). There are some cases where the parameters evaluate as NaN which leads to problems down the line.

OCIO ERROR: The specified transform file 'heavy_transform.clf' could not be loaded.
All formats have been tried. (Enable debug log for errors from all formats.) The format for the file's extension gave the error:

    'Academy/ASC Common LUT Format' failed with: At line 14: Log is not valid: 'Log: Invalid base value 'nan', base cannot be 1.'.

Looking at src/OpenColorIO/ops/log/LogOpData.cpp line 214, the validate() method gets called and at this point m_base has come in as NaN. It looks like it should get the value of 2.0 but for some reason it doesn't (maybe due to the file parser assigning it to overwrite the initial value). I looked at what parameters were not being set in the file and then assigned them explicitly based on what the defaults should be. Here is the diff which made it better:

--- a/heavy_transform.clf
+++ b/heavy_transform.clf
@@ -19,7 +19,7 @@

     <Log inBitDepth="32f" outBitDepth="32f" style="cameraLinToLog">
         <Description>Linear to ACEScct</Description>
-        <LogParams logSideSlope="0.05707762557" logSideOffset="0.5547945205" linSideBreak="0.0078125" />
+        <LogParams logSideSlope="0.05707762557" logSideOffset="0.5547945205" linSideSlope="1" linSideOffset="0" linSideBreak="0.0078125" base="2" />
     </Log>

     <ASC_CDL id="cc01234" inBitDepth="32f" outBitDepth="32f" style="Fwd">

Incidentally, I see the same LogParams stanza used in the following files, which are in the ocio repository

tests/data/files/clf/aces_to_video_with_look.clf
tests/data/files/clf/multiple_ops.clf
tests/data/files/clf/log_all_styles.clf

If there is a push to productize imprecise math, we may want to explicitly set all the parameters in those files, or figure out why the parser is overwriting m_base and other variables with NaN.

AcademySoftwareFoundation / OpenColorIO

Improving CPU performance via math-related compiler optimization flags #1774