Optimization with 'fast math' produces incorrect results and segfaults

ischoegl commented 2 years ago

Problem description

As reported in #1150, compilation with default optimization options for the Intel compiler suite icx/icpx results in incorrect results. The behavior can be reproduced for gcc with the -ffinite-math-only option (which is one of the -ffast-math flags), e.g. for a vanilla gcc toolchain on Ubuntu 20.04:

$ scons build optimize_flags="-O3 -ffinite-math-only"

The option breaks strict IEEE compliance, as it does "Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs." There are some instances where Cantera uses NaN internally which presumably breaks 'fast math' optimizations.

Steps to reproduce

Compile as indicated above.
An attempt to run the test suite results in numerous segfaults.
Output for gcc with fast math (same as for icx in #1150) ...

In [3]: gas2 = ct.Solution('gri30.yaml')

In [4]: gas2.partial_molar_cp
Out[4]: 
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0.])

In [5]: gas2.partial_molar_entropies
Out[5]: 
array([-1.84618157e-12,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06,  5.74342730e+06,  5.74342730e+06,  5.74342730e+06,
        5.74342730e+06])

Behavior

System information

Cantera version: 2.5.1, 2.6.0a3
OS: Ubuntu 20.04 (and presumably others)
Python/MATLAB/other software versions: gcc / icx

Additional context

See initial issue report #1150; a new issue is opened as the behavior is not compiler-dependent.
In case the issue is worth pursuing, some workarounds need to be found for {fmt} output, which currently produces numerous warnings related to NaN compliance.
An alternative check for NaN can be implemented, see SO

bryanwweber commented 2 years ago

@ischoegl and @tpg2114, thanks for digging in to this! Does the -ffast-math actually improve performance for standard workloads that we run? If it does, then this is probably worth digging in to. Otherwise, it might be best to leave the code as-is.

ischoegl commented 2 years ago

Does the -ffast-math actually improve performance for standard workloads that we run?

Based on my understanding, yes. But as IEEE 754 is not enforced (which is where performance gains come from), this may make for some 'fun' refactoring. I mainly created this issue to make sure that users are aware of the limitation. It's somewhat puzzling that icx enables -fp-model fast by default. FWIW, here's a blog post that I found informative and a SO answer, but I didn't dig very deep.

bryanwweber commented 2 years ago

Based on my understanding, yes.

I'm not asking you to do this, but if anyone decides to pursue this issue, it'd be nice to have concrete benchmarks that adding these flags does improve real-world performance that justifies the (presumably) increased code complexity to handle this case.

ischoegl commented 2 years ago

I'm not asking you to do this …

no worries. I’ll pass on this one 😜

speth commented 2 years ago

I investigated this a bit, using Clang++ on Ubuntu 20.04. The only flag that's part of -ffast-math that is a significant problem is -ffinite-math-only. With the rest of the flags enabled by specifying optimize_flags=-O3 -ffast-math -fno-finite-math-only`, I get 4 test failures:

python: test_kinetics.TestReaction.test_Blowers_Masel_change_enthalpy
python: test_reaction.TestBlowersMaselRate.test_from_parts
VCS-LiSi-verbose
cxx-kinetics1

Where the first two at least are just an issue of assertEquals being used where assertNear would be a better choice, and I think the latter two can be resolved without too much difficulty.

Enabling -ffinite-math-only requires eliminating any check that relies on isnan working or even the comparison nan != <some number> returning false. I made some really crude changes to do this in this commit on my fork: https://github.com/speth/cantera/commit/5abdaa8e673827453e5a5d7de6ce4b26f202481f, and did some benchmarking of an ignition delay problem using a couple different mechanisms. For these tests, I used Eigen for the linear algebra and the vendored copy of Sundials, so this is about the maximum impact that these flags can have, since there is very little outside code. I tested both GRI 3.0 and a larger mechanism with ~400 species. What I found was:

Compared to the baseline (-O3, using -O3 -ffast-math -fno-finite-math-only) is about 2% faster in terms of time steps per second.
Compared to the baseline, -O3 -ffast-math is about 4% faster in terms of time steps per second
The differences in the calculations lead to the number of time steps needed for an individual simulation to vary unpredictably, with differences in individual simulations of up to 10%, with the average around 2.5%, so it's hard to say that these optimizations necessarily lead to higher performance on a wall-time basis.

Given the unsatisfactory nature of the changes required to support using -ffinite-math-only and the relatively small performance gains, my recommendation is that we add a configuration-time check for whether isnan works correctly and if not, abort compilation with an error message stating that Cantera doesn't work with this flag.

ischoegl commented 2 years ago

Thanks for looking into this further, @speth! The two non-python tests look familiar, and I agree that assertEquals is easy to fix (should be done regardless).

As an aside, one thing I noticed in my own tests was a plethora of warnings coming out of fmt via AnyMap. In addition, there were also several 'legacy' fmt use cases that caused annoying warnings. Fixing what causes these fmt warnings may be the largest issue on hand, as I don't really like the nuclear option that disables these warnings.

Overall, I agree that it probably doesn't make sense to make this a priority. At the same time, it probably also makes sense to avoid using NaN as a sentinel value (which I have recently used), and to avoid exact / bit-wise equality checks in new code. It should be relatively simple to replace the NaN checks in the new reaction rate evaluators, which probably should happen prior to 2.6. In other words, what I'm arguing for is not to make it a priority to fix, but also not to make it harder to fix going forward.

speth commented 2 years ago

I don't see any warnings related to fmt with either GCC or Clang, so I guess that's specific to the Intel compiler.

I would argue that most of the ways that we use NaN in Cantera are very reasonable, and that the alternatives are often worse. For instance, in the modification I made to the CachedValue class to run with -ffinite-math, the initial cache check value now has to be some arbitrary but finite number. And while it's unlikely that the a cached value would be checked against this initial value and return an erroneous result, it's not impossible.

ischoegl commented 2 years ago

I don't see any warnings related to fmt with either GCC or Clang, so I guess that's specific to the Intel compiler.

That sounds plausible.

I would argue that most of the ways that we use NaN in Cantera are very reasonable, and that the alternatives are often worse.

I tend to agree: using NaN as a sentinel is efficient from a coding perspective. But apparently we're forcing the code to check for this possibility, which is less ideal from a computational perspective (although the penalty appears to be small). I also don't think that we need to avoid NaN as an output value, it's just that there probably need to be internal booleans that replace the checks. Overall, I don't think that there's any urgency to 'fix' this issue ... although it probably needs to remain open.

g3bk47 commented 2 years ago

@speth @bryanwweber @ischoegl Although the issue has been closed, I am reporting some additional data: I ran a small test program benchmarking the computation of reaction rates with 16 different compilers/versions, each with and without fast-math. Quick summary: using fast-math, g++ becomes about 15 % faster and the Intel compilers less than 5 % faster. The relative accuracy in my test is within 10^-10 %. However, even without fast-math, final results for reaction rates between g++/clang++ and icpc/icpx are slightly different, which I cannot explain at the moment. For all details, see here: https://github.com/g3bk47/CanteraCompilerPerformance Let me know if I should benchmark any other code snippets with this setup.

ischoegl commented 2 years ago

@g3bk47 … thanks for those results! Your comparison suite is impressive. From my perspective, a performance gain of up to 15% would be worth pursuing. While segfaults and erroneous results are fixed here (I.e. those issues are addressed), I believe this warrants an enhancement request?

Cantera / cantera

Optimization with 'fast math' produces incorrect results and segfaults #1155