easybuilders / easybuild-easyconfigs

A collection of easyconfig files that describe which software to build using which build options with EasyBuild.
https://easybuild.io
GNU General Public License v2.0
374 stars 700 forks source link

incorrect results for OpenFOAM v10 + v11 when built with GCC 11.3.0 or 12.3.0 and `-ftree-vectorize` #20927

Open boegel opened 3 months ago

boegel commented 3 months ago

We got a report that installations of OpenFOAM (OpenFOAM-10-foss-2022a.eb, OpenFOAM-10-foss-2023a.eb, OpenFOAM-11-foss-2023a.eb) produce incorrect results, see also https://bugs.openfoam.org/view.php?id=4076 .

After a lot of digging, we figured out that this problem is caused by a compiler bug: if these easyconfigs are changed to avoid the use of the -ftree-vectorize compiler option, the problem is resolved:

toolchainopts = {'vectorize': False}
nicolasdlss commented 3 months ago

The incorrect results when using -ftree-vectorize are noticeable when using strict solver tolerances. On the one hand, with -ftree-vectorize the results are (slightly) different and also different compared to older versions (8 and 9). On the other hand and more importantly, when using -ftree-vectorize, the results depend on the partitioning. For some partitionings a large error remained after convergence, which is several order of magnitude bigger than the used solver tolerances. See also https://bugs.openfoam.org/view.php?id=4076.

bartoldeman commented 3 months ago

Is the issue specific to 11.3.0 and 12.3.0 only or does it also happen with earlier/later versions of GCC?

joris13 commented 3 months ago

When testing on a lid driven cavity with icoFoam, with or without -ftree-vectorize does not result in a strong performance difference, so this option can be avoided without heavy penalty. Comparison

boegel commented 2 months ago

Is the issue specific to 11.3.0 and 12.3.0 only or does it also happen with earlier/later versions of GCC?

Problem doesn't appear with OpenFOAM easyconfigs using a toolchain older than 2022a, seems like we're hitting something that was introduced in auto-vectorizer of GCC 11.x and more recent.

Micket commented 2 months ago

Related, same(?) or at least very similar tree-vectorizer issue: https://github.com/easybuilders/easybuild-easyconfigs/pull/15495 In that case, -O3 also "solved" the issue there.

The actual issue here is that OpenFOAM doesn't have a robust test suite that is desperately needs. I would also really like to test this with ASAN enabled -fsanitize=address. If someone could post a jobscript for running the reproducing test case I would appreciate it.

boegel commented 2 months ago

@nicolasdlss Can you provide some clear instructions on how to reproduce the problem, and what to look out for?

@Micket Since we already disable -ftree-vectorize for the other variant of OpenFOAM (ever since #15495), I think it makes sense to also do that for openfoam.org variant, as proposed in #20958, especially since the performance impact seems very minimal...

nicolasdlss commented 2 months ago

This is a fast-running test (order of 1 minute) with an numerical outcome that can be easily used for comparison. Instructions on how to use the test and interpret the results are given in the README. GitHub_reproducer.tar.gz

boegel commented 2 months ago

Re-opening this, I would really like to have some kind of sanity check in place for this, so we don't re-introduce this problem again in the future.

@nicolasdlss How feasible would it be to leverage a tutorial case that is included with OpenFOAM for this?

We could also look into integrating your minimal test case as a sanity check of course, but that's a bit messier since it involves external input files, etc.

klust commented 2 months ago

After hearing about the issue in the EasyBuild conference call of July 17 2024 I checked the setup on LUMI to see if we are also affected.

It turns out that by default OpenFOAM with the GNU compilers will use -O3 (but not -ftree-vectorize). So this was likely also what the developers used when trying to reproduce the bug reported by Nicolas and is likely the configuration they use when testing OpenFOAM.

It might be a good idea (and for more packages actually) to try to check which compiler options developers use, and if they have a reasonable level of optimization, simply configure the toolchainopts to try to reproduce them as closely as possible to avoid such problems.

Micket commented 2 months ago

-O3 enables -ftree-vectorize. In fact, starting with GCC 12, -O2 enables tree vectorizer as well (but not as aggressive). The fact that observable changes in the computations happens when changing any standard compilation flags is a sign that something wrong with the code; some undefined behavior, race condition or poorly written manual SIMD code. The fact that this happens when we actually lower the optimization flags is just extra worrying. We aren't talking about fast-math is anything fragile like that.

Trying to mimic their defaults is just praying that it just happens to work without testing, This will almost certainly just randomly occur whenever someone decides to try a different compiler, version, CPU architecture. Given that this has been known to occur across several GCC versions (that have not been known to have any other issues in anything we build), i still believe this is just broken OpenFOAM code and it really just needs to be fixed. Starting off with some sanitizers; in particular -fsanitize=address and/or -fsanitize=undefined (though there are also others that could be interesting).