Make Cantera compatible with -ffast-math

Cantera / enhancements

Repository for proposed and ongoing enhancements to Cantera

11 stars 5 forks source link

Make Cantera compatible with -ffast-math #125

Open g3bk47 opened 2 years ago

g3bk47 commented 2 years ago

Abstract

After the recent discussion about compiling Cantera with -ffast-math (which is basically the default for the Intel compilers), I set up a benchmark suite to test the accuracy and computational performance of Cantera when using different optimization flags and compilers. The relevant discussions can be found here: https://github.com/Cantera/cantera/issues/1155 https://github.com/Cantera/cantera/issues/1150 https://github.com/Cantera/cantera/commit/9daebd9c39a1c891dbd6bc8e50a332c80d23ee5a

Motivation/Results

I ran different sample programs (evaluation of reaction rates, 0D reactor and 1D flame) with 16 different compilers/versions and 8 different optimization settings. The findings can be summarized as follows:

Even at the strictest settings, Intel compilers and g++/clang++ do not yield the same results (bitwise) in general.
For simpler cases, differences in results are within O(10^-7 %).
For g++, O2 generates slightly slower code compared to O3 but without affecting the results.
In many cases, using fastmath increases performance by 10 % to 15 % for g++. Using fastmath together with no-finite-math-only increases performance by only 5 %. However, both options can drastically deteriorate convergence behavior and should therefore not be the default.
For the Intel compiler, fp-model strict is sligthly slower than fp-model precise but the accuracy is the same in all test cases. fastmath together with no-finite-math-only produces sligthly faster code and can be used together with Cantera, however, convergence might again deteriorate drastically. In general, the different optimization settings have much less effect for the Intel compilers than for g++/clang++.

From my tests above, the current defaults of Cantera seem to be the optimal compromise between performance and safety:

O3 for g++/clang++
O3 -fp-model precise for the Intel compilers

Since fastmath without no-finite-math-only can improve the performance of g++ for simple cases like the evaluation of reaction rates by 15 %, it would be nice for Cantera to be compatible with this option, e.g. for users coupling Cantera to other CFD codes. However, this means that the internal use of NaNs and Infs would have to be removed.

Let me know if you have any other interesting code snippets that should be benchmarked to aid the discussion.

References

For all details of my benchmark suite, please see: https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/README.md

ischoegl commented 2 years ago

Hi @g3bk47 ... thanks again for setting up these tests. Could you clarify your comments on of the convergence issues that you observed?

@speth commented with separate testing in Cantera/cantera#1155 (for ignition delay) that ...

The differences in the calculations lead to the number of time steps needed for an individual simulation to vary unpredictably, with differences in individual simulations of up to 10%, with the average around 2.5%, so it's hard to say that these optimizations necessarily lead to higher performance on a wall-time basis.

Is this consistent with your conclusions?

g3bk47 commented 2 years ago

Hi, in my work, I am using Cantera to calculate reaction rates and mixture-averaged diffusion properties, which does not include any type of iteration. So I would fully profit from the ~15% performance increase with gcc.

In my tests, there was only one case where more aggressive optimization settings affected performance negatively, i.e.1D flame with very tight tolerances (while the 1D flame with more relaxed tolerances showed some speedup). However, the negative effect in the case of tight tolerances is quite extreme. Just to pick a few data points from Table 13 (https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/README.md, https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/oneD.cpp): Running the 1D flame requires the following wall clock time:

g++11 with O3: 0.19h
g++11 with O3 and fastmath: 0.77h
icpc2021 with O3: 0.34h
icpc2021 with O3 and fastmath: 14.7h
icpx2021.1 with O3: 0.31h
icpx2021.1 with O3 and fastmath: 0.32h

So in the three testcases I have looked at (reaction rates, 0D, 1D), the impact of more aggressive compiler optimizations on performance was quite binary: Either using fastmath led to a some speedup of <= 15%, or a massive slowdown of 3x to 40x. I did not look into the solver output to see what the actual reason for that is or whether compiling all external libraries with O3 and only Cantera with O3 and fastmath might fix the problem. With the data so far, my conclusion would be that enabling fastmath only makes sense if Cantera is used for any type of problem that does not involve iterative solutions, which arguably excludes most use cases of Cantera.

Of course, my tests so far were limited to just a handful of sample programs. Feel free to suggest any other test programs I could throw at the test suite or if I should look more into the cases with massive slow down.

ischoegl commented 2 years ago

Thanks, @g3bk47! ... for my own part, I am planning to revise some of the instances where I recently introduced NaN's as sentinel values. Speedups of 15% are very intriguing, despite the troubling slowdowns observed in other instances. One thing that I can say about convergence is that even without fastmath, the solvers sometimes produced quirky failures for me that usually took 'playing' with tolerances to resolve.

speth commented 2 years ago

Thanks for the extensive set of tests, @g3bk47. I tried replicating some of these results using your reactionRates.cpp test on some of my machines, and interestingly, the results are substantially different. I didn't run quite as many cases, but here's the results for what I did:

compiler	optimization flags	median runtime (s)	std. deviation (s)
GCC 9.3	`-O3`	16.303	0.202
GCC 9.3	`-O3 -ffast-math`	15.653	0.192
GCC 9.3	`-O3 -ffast-math -f-no-finite-math-only`	15.757	0.236
ICPX 2021.3.0	`-O3 -fp-model precise`	15.018	0.211
ICPX 2021.3.0	`-O3 -fp-model fast -ffast-math`	15.010	0.203

This is using your grep-based patch to the Cantera source, and 20 runs of the test program in each case. Two results here stand out here in comparison to your runs, and I'm curious about both of them.

First, I found that the Intel compiler is only about 8% faster than GCC, rather than ~35% faster as you found. I don't know what could account for such a huge performance difference.
Second, I find that with GCC, using the full fast-math optimizations only gives a ~4% speedup, and that using fast math with just the "finite-math-only" option disabled still yields a ~3.4% speedup, implying that the benefit of just breaking the standard NaN behavior is only netting a 0.6% speedup, which doesn't seem worth pushing for.
Last, I found no significant performance difference with the Intel compiler with the two different optimization options, in contrast to your results where for the same options there was ~4% performance improvement.

I ran these tests on a system with Xeon E5-2650 v4 (2.20GHz) CPUs, which are a bit older (2016 vintage, rather than your 2021 processors). That may provide some explanation of what's happening with the Intel compiler -- it may be able to generate code that uses some processor features that GCC hasn't been updated to use yet.

g3bk47 commented 2 years ago

Thanks for the interesting results, @speth. I agree that the difference between fastmathand fastmath+nofinitmath is generally small, but at least I had one case where the difference was 10%. As far as I know, even if the Intel compilers are not told what the target CPU is, they can generate different code paths for different CPU types. Maybe this plays a role here. The only way to find out would probably be to look at the generated assembly.

Just few additional thoughts on what might cause the differences in our results (apart from the different CPUs):

Did you recompile all external libraries with the same flags as Cantera itself? In the case of the reaction rates, probably only Eigen matters, which is used header only I think, so it shouldn't matter. I cloned Eigen as part of the installation process, maybe we used different versions?
In my experience, the linux kernel/libc/libm versions can also play a role. In the reaction rate example, I would guess that a large portion of the runtime is taken up by calls to std::exp, which will be calls to ::exp from libm in the case of gcc. Since I ran my test on a newer HPC cluster, they might have a newer version of libm.
Maybe the Intel compiler makes use of other libraries that are available on the HPC cluster. At least for vectorized code, Intel calls functions from its short vector math library. Maybe in my case, it finds other pre-installed libraries it can use.

I will run my test suite again on another cluster. There, 15 different compilers/versions and two different compute nodes are available:

Intel Xeon E5-2660 v4 (Broadwell, 14 cores, 2.0 GHz, from 2016)
Intel Xeon Gold 6230 (Cascade Lake, 20 cores, 2.1 GHz, from 2019)

Maybe I can reproduce your results there or at least provide additional data points.

speth commented 2 years ago

Yes, you're correct that the Intel compiler (and maybe others) can generate multiple code paths and select different ones at run time based on the specific processor. The most infamous use of this has been to use less-optimal code paths when running on AMD processors. In this case, that behavior may be why there's so little impact of telling it to emit code for your specific processor rather than the more backwards-compatible default, if it's able to use the more optimized path opportunistically.

I did not recompile any other libraries that Cantera links to. However, for the code in question, there isn't much happening outside calls to the C++ standard library. The rate evaluations don't even use Eigen.

For my system, libm comes from glibc version 2.31, which is what is used in Ubuntu 20.04. I see that the binaries built with icpx also link to the system libm as well, although I don't know if it ends up calling the implementation of exp from that library or something provided by one of the Intel-specific libraries.

g3bk47 commented 2 years ago

The libc on all clusters I have access to is actually older than yours (version 2.28). Another wild guess why our results differ might be because your CPU clocks down when it get too hot so that there is some kind of lower bound for performance?

I mentioned Eigen because it appears here https://github.com/Cantera/cantera/blob/main/include/cantera/kinetics/StoichManager.h#L617-L618. But I am not entirely sure if this is used in the sample program.

I ran my test suite again on the two other systems. One of the systems uses an Intel Xeon E5-2660 v4 CPU, which sounds close to your setup. However, I again got pretty much the opposite of your results: using fastmath still gives a ~10% performance gain over fastmath+nofinitemath and the Intel compiler is significantly faster than gcc. For all details, please see the new results at https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/NewSystems.md.

This time, I measured the code performance with a profiler and looked into the generated assembly (again, see https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/NewSystems.md#2-profiling for more details). To briefly sum up my first findings:

Compiling Cantera with gcc and fastmath+nofinitemath creates a 10% overhead on my systems due to the use of a different version of exp that includes additional error handling
The Intel compilers find more optimization opportunities on all optimization settings, e.g. by vectorizing some exp calls, see the link above for details.

With these preliminary findings, my performance measurements sound plausible to me.

speth commented 2 years ago

Ah, those profiling results are very interesting. I'm confused as to what's happening on my system -- when compiling with GCC and -O3 -ffast-math, I still see the calls to __GI___exp wrapping the calls to __ieee754_exp_fma, which would pretty easily explain why I'm not seeing any performance improvement, but I don't understand why it's not skipping this, if that is indeed an error handling wrapper as you've suggested. I'll have to try again on a couple of different machines to see if this is indeed associated with the version of glibc.

ischoegl commented 2 years ago

PR Cantera/cantera#1330 ~proposes to remove~ (now merged) removes NaN sentinel values from ReactionRate code.

ischoegl commented 1 year ago

Reposting here as it has come up elsewhere: someones-been-messing-with-my-subnormals (discussion about pitfalls of —fastmath; it is about Python but the underlying issues appears relevant).