GR_monopole test fails CI occasionally #398

Open jmstone opened 1 year ago

In GitLab by @jmstone216 on Aug 16, 2023, 13:54

When merging the bump to Kokko4.1.0, it was noticed that the gr_monopole test in CI fails infrequently, and in unpredictable ways. Is this a race condition? Something else? This needs to be explored further, but does not seem to be fatal issue at the moment.

A perhaps related issue is that more CI tests are needed. We should add regression tests for bitwise compatibility, and all the new physics and features that have been added lately.

In GitLab by @jfields7 on Nov 1, 2023, 15:23

Are any of the errors invalid memory accesses? If so, I wonder if this is related to the scratch memory issues that crept into Kokkos during the 4.1.0 release.

In GitLab by @jmstone216 on Nov 3, 2023, 09:18

The code runs without crashing, it is just that the L1 error in the test is too large and the test fails by the criteria we have set. Currently I have no idea why the error is not completely deterministic, and it certainly is worrying.

In GitLab by @jfields7 on Jun 13, 2024, 17:51

I ran into this issue while preparing !166 and spent some time looking at it. Here's a summary of what I've found so far:

Errors are not consistent between CPU and GPU runs, though they do seem to be deterministic on my personal machine.
Errors are not consistent between master and z4c-matter-rebase, though changes in the latter are mostly independent of the standard AthenaK GRMHD solver.
CPU errors seem to be independent of the number of OpenMP threads. Along with point 1, this may suggest the issue is not a race condition or a problem with a non-deterministic reduction operation.
Running on CPU with -fsanitize=address and -O2g instead of -O3 produces different results. Since -O3 shouldn't enable unsafe mathematical operations, the issue is probably not directly related to optimization flags. This may suggest a memory issue, but it's probably indirect because neither Kokkos's optional bounds checking or AddressSanitizer catch it.
Compiling with -fPIC and -O3 provides results consistent with -O3 alone.
Comparing the VTK data between CPU and GPU confirms that the initial conditions are the same to single precision. However, the evolutions do show small differences.

My best guess at this point is that we're looking for a memory issue somewhere, but it's something very subtle.

IAS-Astrophysics / athenak

GR_monopole test fails CI occasionally #398