Open g3bk47 opened 2 years ago
Hi @g3bk47 ... thanks again for setting up these tests. Could you clarify your comments on of the convergence issues that you observed?
@speth commented with separate testing in Cantera/cantera#1155 (for ignition delay) that ...
The differences in the calculations lead to the number of time steps needed for an individual simulation to vary unpredictably, with differences in individual simulations of up to 10%, with the average around 2.5%, so it's hard to say that these optimizations necessarily lead to higher performance on a wall-time basis.
Is this consistent with your conclusions?
Hi, in my work, I am using Cantera to calculate reaction rates and mixture-averaged diffusion properties, which does not include any type of iteration. So I would fully profit from the ~15% performance increase with gcc.
In my tests, there was only one case where more aggressive optimization settings affected performance negatively, i.e.1D flame with very tight tolerances (while the 1D flame with more relaxed tolerances showed some speedup). However, the negative effect in the case of tight tolerances is quite extreme. Just to pick a few data points from Table 13 (https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/README.md, https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/oneD.cpp): Running the 1D flame requires the following wall clock time:
So in the three testcases I have looked at (reaction rates, 0D, 1D), the impact of more aggressive compiler optimizations on performance was quite binary: Either using fastmath led to a some speedup of <= 15%, or a massive slowdown of 3x to 40x. I did not look into the solver output to see what the actual reason for that is or whether compiling all external libraries with O3 and only Cantera with O3 and fastmath might fix the problem. With the data so far, my conclusion would be that enabling fastmath only makes sense if Cantera is used for any type of problem that does not involve iterative solutions, which arguably excludes most use cases of Cantera.
Of course, my tests so far were limited to just a handful of sample programs. Feel free to suggest any other test programs I could throw at the test suite or if I should look more into the cases with massive slow down.
Thanks, @g3bk47! ... for my own part, I am planning to revise some of the instances where I recently introduced NaN
's as sentinel values. Speedups of 15% are very intriguing, despite the troubling slowdowns observed in other instances. One thing that I can say about convergence is that even without fastmath
, the solvers sometimes produced quirky failures for me that usually took 'playing' with tolerances to resolve.
Thanks for the extensive set of tests, @g3bk47. I tried replicating some of these results using your reactionRates.cpp
test on some of my machines, and interestingly, the results are substantially different. I didn't run quite as many cases, but here's the results for what I did:
compiler | optimization flags | median runtime (s) | std. deviation (s) |
---|---|---|---|
GCC 9.3 | -O3 |
16.303 | 0.202 |
GCC 9.3 | -O3 -ffast-math |
15.653 | 0.192 |
GCC 9.3 | -O3 -ffast-math -f-no-finite-math-only |
15.757 | 0.236 |
ICPX 2021.3.0 | -O3 -fp-model precise |
15.018 | 0.211 |
ICPX 2021.3.0 | -O3 -fp-model fast -ffast-math |
15.010 | 0.203 |
This is using your grep
-based patch to the Cantera source, and 20 runs of the test program in each case. Two results here stand out here in comparison to your runs, and I'm curious about both of them.
NaN
behavior is only netting a 0.6% speedup, which doesn't seem worth pushing for.I ran these tests on a system with Xeon E5-2650 v4 (2.20GHz) CPUs, which are a bit older (2016 vintage, rather than your 2021 processors). That may provide some explanation of what's happening with the Intel compiler -- it may be able to generate code that uses some processor features that GCC hasn't been updated to use yet.
Thanks for the interesting results, @speth. I agree that the difference between fastmath
and fastmath+nofinitmath
is generally small, but at least I had one case where the difference was 10%.
As far as I know, even if the Intel compilers are not told what the target CPU is, they can generate different code paths for different CPU types. Maybe this plays a role here. The only way to find out would probably be to look at the generated assembly.
Just few additional thoughts on what might cause the differences in our results (apart from the different CPUs):
I will run my test suite again on another cluster. There, 15 different compilers/versions and two different compute nodes are available:
Maybe I can reproduce your results there or at least provide additional data points.
Yes, you're correct that the Intel compiler (and maybe others) can generate multiple code paths and select different ones at run time based on the specific processor. The most infamous use of this has been to use less-optimal code paths when running on AMD processors. In this case, that behavior may be why there's so little impact of telling it to emit code for your specific processor rather than the more backwards-compatible default, if it's able to use the more optimized path opportunistically.
I did not recompile any other libraries that Cantera links to. However, for the code in question, there isn't much happening outside calls to the C++ standard library. The rate evaluations don't even use Eigen.
For my system, libm
comes from glibc
version 2.31, which is what is used in Ubuntu 20.04. I see that the binaries built with icpx
also link to the system libm
as well, although I don't know if it ends up calling the implementation of exp
from that library or something provided by one of the Intel-specific libraries.
The libc on all clusters I have access to is actually older than yours (version 2.28). Another wild guess why our results differ might be because your CPU clocks down when it get too hot so that there is some kind of lower bound for performance?
I mentioned Eigen because it appears here https://github.com/Cantera/cantera/blob/main/include/cantera/kinetics/StoichManager.h#L617-L618. But I am not entirely sure if this is used in the sample program.
I ran my test suite again on the two other systems. One of the systems uses an Intel Xeon E5-2660 v4 CPU, which sounds close to your setup. However, I again got pretty much the opposite of your results: using fastmath
still gives a ~10% performance gain over fastmath+nofinitemath
and the Intel compiler is significantly faster than gcc. For all details, please see the new results at https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/NewSystems.md.
This time, I measured the code performance with a profiler and looked into the generated assembly (again, see https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/NewSystems.md#2-profiling for more details). To briefly sum up my first findings:
fastmath+nofinitemath
creates a 10% overhead on my systems due to the use of a different version of exp
that includes additional error handlingexp
calls, see the link above for details.With these preliminary findings, my performance measurements sound plausible to me.
Ah, those profiling results are very interesting. I'm confused as to what's happening on my system -- when compiling with GCC and -O3 -ffast-math
, I still see the calls to __GI___exp
wrapping the calls to __ieee754_exp_fma
, which would pretty easily explain why I'm not seeing any performance improvement, but I don't understand why it's not skipping this, if that is indeed an error handling wrapper as you've suggested. I'll have to try again on a couple of different machines to see if this is indeed associated with the version of glibc
.
PR Cantera/cantera#1330 ~proposes to remove~ (now merged) removes NaN
sentinel values from ReactionRate
code.
Reposting here as it has come up elsewhere: someones-been-messing-with-my-subnormals (discussion about pitfalls of —fastmath
; it is about Python but the underlying issues appears relevant).
Abstract
After the recent discussion about compiling Cantera with
-ffast-math
(which is basically the default for the Intel compilers), I set up a benchmark suite to test the accuracy and computational performance of Cantera when using different optimization flags and compilers. The relevant discussions can be found here: https://github.com/Cantera/cantera/issues/1155 https://github.com/Cantera/cantera/issues/1150 https://github.com/Cantera/cantera/commit/9daebd9c39a1c891dbd6bc8e50a332c80d23ee5aMotivation/Results
I ran different sample programs (evaluation of reaction rates, 0D reactor and 1D flame) with 16 different compilers/versions and 8 different optimization settings. The findings can be summarized as follows:
g++/clang++
do not yield the same results (bitwise) in general.g++
,O2
generates slightly slower code compared toO3
but without affecting the results.fastmath
increases performance by 10 % to 15 % forg++
. Usingfastmath
together withno-finite-math-only
increases performance by only 5 %. However, both options can drastically deteriorate convergence behavior and should therefore not be the default.fp-model strict
is sligthly slower thanfp-model precise
but the accuracy is the same in all test cases.fastmath
together withno-finite-math-only
produces sligthly faster code and can be used together with Cantera, however, convergence might again deteriorate drastically. In general, the different optimization settings have much less effect for the Intel compilers than forg++/clang++
.From my tests above, the current defaults of Cantera seem to be the optimal compromise between performance and safety:
O3
forg++/clang++
O3 -fp-model precise
for the Intel compilersSince
fastmath
withoutno-finite-math-only
can improve the performance ofg++
for simple cases like the evaluation of reaction rates by 15 %, it would be nice for Cantera to be compatible with this option, e.g. for users coupling Cantera to other CFD codes. However, this means that the internal use of NaNs and Infs would have to be removed.Let me know if you have any other interesting code snippets that should be benchmarked to aid the discussion.
References
For all details of my benchmark suite, please see: https://github.com/g3bk47/CanteraCompilerPerformance/blob/main/README.md