eagles-project / haero

A toolbox for constructing performance portable aerosol packages
Other
3 stars 3 forks source link

Autotesting environment and Jeff's workstation appear to have issues with EKAT packs in Kohler solve test #196

Closed jeff-cohere closed 3 years ago

jeff-cohere commented 3 years ago

A failing test in #195 alerted me and Pete to a potential issue between the SIMD packs we us in EKAT. There we see NANs generated in the solve process for the Kohler equation. This issue appears on Release builds with a pack size of 4. The issue doesn't appear on builds using older versions of GCC, and it also doesn't appear in Clang builds (though this is because SIMD ops aren't enabled for Clang).

To proceed, we either have to

  1. try to understand the issue and form an opinion on whether this is an EKAT bug or a GCC 10 bug
  2. back off to an older version of GCC in our autotesting environment
jeff-cohere commented 3 years ago

For the second approach, I think it might be easiest for me to rebuild the auto-tester Docker containers (which I have to do to add clang-format anyway) and add older versions of GCC. Then the testing workflow can define OMPI_CC, OMPI_CXX, and OMPI_FC to set the appropriate compilers. I'll try to do this today. @pbosler

jeff-cohere commented 3 years ago

Update: I've reproduced this failure on my workstation using GCC 7.5, 8.4, 9.3, and 10.3 by overriding the compiler version with the environment variables above. I've verified that CMake recognizes these compiler versions. So it appears that something else might be happening.

@pbosler is going to try to reproduce the error on his machine using a newer compiler--he can't see it using GCC 8.3. It would be strange if this particular version of GCC was different from all the other versions in this very specific way.

jeff-cohere commented 3 years ago

Pete has reproduced the error for GCC 9.2.

jeff-cohere commented 3 years ago

We've fixed the NAN issue by setting the optimization level for Release builds to O2 instead of O3. With this setting, there's still a failure in one of the tests, but I'm not sure we can pin it on a compiler issue at this point.