Cantera / enhancements

Repository for proposed and ongoing enhancements to Cantera
11 stars 5 forks source link

Make Cantera compatible with -ffast-math #125

Open g3bk47 opened 2 years ago

g3bk47 commented 2 years ago

Abstract

After the recent discussion about compiling Cantera with -ffast-math (which is basically the default for the Intel compilers), I set up a benchmark suite to test the accuracy and computational performance of Cantera when using different optimization flags and compilers. The relevant discussions can be found here: https://github.com/Cantera/cantera/issues/1155 https://github.com/Cantera/cantera/issues/1150 https://github.com/Cantera/cantera/commit/9daebd9c39a1c891dbd6bc8e50a332c80d23ee5a

Motivation/Results

I ran different sample programs (evaluation of reaction rates, 0D reactor and 1D flame) with 16 different compilers/versions and 8 different optimization settings. The findings can be summarized as follows:

From my tests above, the current defaults of Cantera seem to be the optimal compromise between performance and safety:

Since fastmath without no-finite-math-only can improve the performance of g++ for simple cases like the evaluation of reaction rates by 15 %, it would be nice for Cantera to be compatible with this option, e.g. for users coupling Cantera to other CFD codes. However, this means that the internal use of NaNs and Infs would have to be removed.

Let me know if you have any other interesting code snippets that should be benchmarked to aid the discussion.

References

For all details of my benchmark suite, please see: https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/README.md

ischoegl commented 2 years ago

Hi @g3bk47 ... thanks again for setting up these tests. Could you clarify your comments on of the convergence issues that you observed?

@speth commented with separate testing in Cantera/cantera#1155 (for ignition delay) that ...

The differences in the calculations lead to the number of time steps needed for an individual simulation to vary unpredictably, with differences in individual simulations of up to 10%, with the average around 2.5%, so it's hard to say that these optimizations necessarily lead to higher performance on a wall-time basis.

Is this consistent with your conclusions?

g3bk47 commented 2 years ago

Hi, in my work, I am using Cantera to calculate reaction rates and mixture-averaged diffusion properties, which does not include any type of iteration. So I would fully profit from the ~15% performance increase with gcc.

In my tests, there was only one case where more aggressive optimization settings affected performance negatively, i.e.1D flame with very tight tolerances (while the 1D flame with more relaxed tolerances showed some speedup). However, the negative effect in the case of tight tolerances is quite extreme. Just to pick a few data points from Table 13 (https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/README.md, https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/oneD.cpp): Running the 1D flame requires the following wall clock time:

So in the three testcases I have looked at (reaction rates, 0D, 1D), the impact of more aggressive compiler optimizations on performance was quite binary: Either using fastmath led to a some speedup of <= 15%, or a massive slowdown of 3x to 40x. I did not look into the solver output to see what the actual reason for that is or whether compiling all external libraries with O3 and only Cantera with O3 and fastmath might fix the problem. With the data so far, my conclusion would be that enabling fastmath only makes sense if Cantera is used for any type of problem that does not involve iterative solutions, which arguably excludes most use cases of Cantera.

Of course, my tests so far were limited to just a handful of sample programs. Feel free to suggest any other test programs I could throw at the test suite or if I should look more into the cases with massive slow down.

ischoegl commented 2 years ago

Thanks, @g3bk47! ... for my own part, I am planning to revise some of the instances where I recently introduced NaN's as sentinel values. Speedups of 15% are very intriguing, despite the troubling slowdowns observed in other instances. One thing that I can say about convergence is that even without fastmath, the solvers sometimes produced quirky failures for me that usually took 'playing' with tolerances to resolve.

speth commented 2 years ago

Thanks for the extensive set of tests, @g3bk47. I tried replicating some of these results using your reactionRates.cpp test on some of my machines, and interestingly, the results are substantially different. I didn't run quite as many cases, but here's the results for what I did:

compiler optimization flags median runtime (s) std. deviation (s)
GCC 9.3 -O3 16.303 0.202
GCC 9.3 -O3 -ffast-math 15.653 0.192
GCC 9.3 -O3 -ffast-math -f-no-finite-math-only 15.757 0.236
ICPX 2021.3.0 -O3 -fp-model precise 15.018 0.211
ICPX 2021.3.0 -O3 -fp-model fast -ffast-math 15.010 0.203

This is using your grep-based patch to the Cantera source, and 20 runs of the test program in each case. Two results here stand out here in comparison to your runs, and I'm curious about both of them.

I ran these tests on a system with Xeon E5-2650 v4 (2.20GHz) CPUs, which are a bit older (2016 vintage, rather than your 2021 processors). That may provide some explanation of what's happening with the Intel compiler -- it may be able to generate code that uses some processor features that GCC hasn't been updated to use yet.

g3bk47 commented 2 years ago

Thanks for the interesting results, @speth. I agree that the difference between fastmathand fastmath+nofinitmath is generally small, but at least I had one case where the difference was 10%. As far as I know, even if the Intel compilers are not told what the target CPU is, they can generate different code paths for different CPU types. Maybe this plays a role here. The only way to find out would probably be to look at the generated assembly.

Just few additional thoughts on what might cause the differences in our results (apart from the different CPUs):

I will run my test suite again on another cluster. There, 15 different compilers/versions and two different compute nodes are available:

Maybe I can reproduce your results there or at least provide additional data points.

speth commented 2 years ago

Yes, you're correct that the Intel compiler (and maybe others) can generate multiple code paths and select different ones at run time based on the specific processor. The most infamous use of this has been to use less-optimal code paths when running on AMD processors. In this case, that behavior may be why there's so little impact of telling it to emit code for your specific processor rather than the more backwards-compatible default, if it's able to use the more optimized path opportunistically.

I did not recompile any other libraries that Cantera links to. However, for the code in question, there isn't much happening outside calls to the C++ standard library. The rate evaluations don't even use Eigen.

For my system, libm comes from glibc version 2.31, which is what is used in Ubuntu 20.04. I see that the binaries built with icpx also link to the system libm as well, although I don't know if it ends up calling the implementation of exp from that library or something provided by one of the Intel-specific libraries.

g3bk47 commented 2 years ago

The libc on all clusters I have access to is actually older than yours (version 2.28). Another wild guess why our results differ might be because your CPU clocks down when it get too hot so that there is some kind of lower bound for performance?

I mentioned Eigen because it appears here https://github.com/Cantera/cantera/blob/main/include/cantera/kinetics/StoichManager.h#L617-L618. But I am not entirely sure if this is used in the sample program.

I ran my test suite again on the two other systems. One of the systems uses an Intel Xeon E5-2660 v4 CPU, which sounds close to your setup. However, I again got pretty much the opposite of your results: using fastmath still gives a ~10% performance gain over fastmath+nofinitemath and the Intel compiler is significantly faster than gcc. For all details, please see the new results at https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/NewSystems.md.

This time, I measured the code performance with a profiler and looked into the generated assembly (again, see https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/NewSystems.md#2-profiling for more details). To briefly sum up my first findings:

With these preliminary findings, my performance measurements sound plausible to me.

speth commented 2 years ago

Ah, those profiling results are very interesting. I'm confused as to what's happening on my system -- when compiling with GCC and -O3 -ffast-math, I still see the calls to __GI___exp wrapping the calls to __ieee754_exp_fma, which would pretty easily explain why I'm not seeing any performance improvement, but I don't understand why it's not skipping this, if that is indeed an error handling wrapper as you've suggested. I'll have to try again on a couple of different machines to see if this is indeed associated with the version of glibc.

ischoegl commented 2 years ago

PR Cantera/cantera#1330 ~proposes to remove~ (now merged) removes NaN sentinel values from ReactionRate code.

ischoegl commented 1 year ago

Reposting here as it has come up elsewhere: someones-been-messing-with-my-subnormals (discussion about pitfalls of —fastmath; it is about Python but the underlying issues appears relevant).