Bitwise reproducibility - [merged] #535

Closed jmstone closed 2 months ago

In GitLab by @jmstone216 on Jun 27, 2023, 16:06

Merges bitwise -> master

This MR is part of a long saga in trying to ensure results computed with AthenaK are bitwise reproducible on GPUs. That is, if the same executable and input file are run twice on the same GPU, are the resulting outputs bitwise identical?

A number of changes to the boundary and FOFC functions that fix race conditions that resulted in non-bitwise-reproducible results are implemented. In particular, the buffer unpack function for face-centered variables now has an explicit for-loop over boundary buffers for each mesh block inside the kernel to guarantee that buffers are always unpacked in the same order. Some buffers contain overlapping data for face-centered variables in the ghost zones, and thus when the kernel is also parallelized over buffers there is a race condition with the value stored in the ghost zone being from the buffer that finishes last. This is not wrong (to machine precision), but it is not bitwise compatible. Note that the buffer pack/unpack functions for cell-centered variables do not have this problem, as different buffers never touch the same ghost cells.

Similarly, the kernel which computes first-order fluxes in MHD in the FOFC function also had a race condition when two neighboring cells were both flagged for correction, and the flux at the face between them would then be overwritten twice, potentially leading to a race condition. For reasons I don't fully understand, the only fix that worked in this case was breaking the kernel into two, one for computing the flux at the L-face of cells, and one for computing the flux at the R-face of flagged cells.

I also used bitwise compatibility to test if we need all the team_barrier() calls in the buffer pack and unpack functions for boundary conditions. The short answer is for cell-centered variables no, but for face-centered variables yes. Moreover, removing the barriers made absolutely no difference to the performance of the code when run on apollo (A100). So I decided to leave them all in for safety sake.

These changes were tested by running a GRMHD torus accretion simulation with 9 levels of static mesh refinement and both horizon excision and FOFC enabled to 5000M (more than 100K cycles). I found that simply running to say ~100M was not always good enough.

Finally, there are still aspects of bitwise reproducibility I do not understand. For example I found (by accident) that if I run the same executable but change the number of outputs (e.g. add a new output block) in the input file, then the date written by the outputs common to both runs are no longer bitwise identical. Maybe this is a bug in the way the output data is being written? In any case, I have spent way too much time on this already, and I think the changes in this MR are useful and improve the code, but I am not willing to spend any more time at the moment to track down if the code is bitwise reproducible in all usage cases. This is an issue for the future.

In GitLab by @jmstone216 on Jun 27, 2023, 16:06

requested review from @pdmullen

In GitLab by @jmstone216 on Jun 27, 2023, 16:10

added 94 commits

066431e2...ce31adbd - 93 commits from branch master
696882f8 - Merge branch 'master' into 'bitwise'

Compare with previous version

In GitLab by @jmstone216 on Jun 27, 2023, 16:15

added 2 commits

c8e271bb - Style changes to fix linting errors.
70749f8a - Merge branch 'bitwise' of...

Compare with previous version

In GitLab by @jmstone216 on Jun 27, 2023, 18:59

added 1 commit

850ba48a - Fixed compile bug in HLLE solver that crept in during merge.

Compare with previous version

In GitLab by @pdmullen on Jun 27, 2023, 22:54

Commented on src/mhd/mhd_fofc.cpp line 46

if this change actually matters, then we should consider this elsewhere. This usage is widespread across the codebase.

In GitLab by @pdmullen on Jun 27, 2023, 22:54

approved this merge request

In GitLab by @pdmullen on Jun 27, 2023, 23:01

added 1 commit

3deb7ca6 - Adjust tolerances for gr_monopole problem

Compare with previous version

In GitLab by @pdmullen on Jun 27, 2023, 23:10

If any test problem were to fail due to the changes in this MR, I would have guessed it was the gr_monopole problem.

This problem relies on FOFC and straddles the edge of what I think our GRMHD algorithm is capable of integrating.

I suspect that fortuitous round-off error differences in implementations allowed for master to fall within the very strict tolerances I emplaced whereas bitwise fell short. I have adjusted the tolerances in an additional commit to this MR and suspect that the regression suite will now pass.

To give you an idea of how master and bitwise differed on this problem, see the below:

monopole_diff

The regression suite (1) checks that the average $\Omega$ is within some error threshold of the expected value ($\Omega / \Omega_H \sim 0.5$) and (2) checks that the standard deviation falls within some tolerance. I adjusted the latter such that this MR should pass CI.

Please feel free to revert this commit if you disagree with my claim that this MR is not actually introducing a breaking change.

In GitLab by @jmstone216 on Jun 28, 2023, 10:32

Commented on src/mhd/mhd_fofc.cpp line 46

No, this change does not matter, but is a remnant of my trying many things. I forgot to switch it back to references, but I can do that now.

In GitLab by @jmstone216 on Jun 28, 2023, 10:32

resolved all threads

In GitLab by @jmstone216 on Jun 28, 2023, 10:37

It is disappointing that bitwise increases the error and standard deviation in the grmonopole. But I agree, it doesn't seem to break the results, just increase the error. Should we be worried about the point at \theta{KS} ~ 1.2? Both bitwise and master show an anomalous value there. MeshBlock boundary?

In GitLab by @pdmullen on Jun 28, 2023, 11:47

In my experience, the standard deviation decreases as one lowers the ceiling on the normal frame Lorentz factor or as one implements more robust fallback strategies (actually, the lowest standard deviation I have obtained came about from a strategy where one averages neighbors in the event of C2P failure). Hence, I suspect that outliers in the plot above are just locations where C2P gave trash values.

It is curious, however, that an outlier seems to consistently appear at $\theta\mathrm{KS} \sim 1.2$. The problem is not invoking SMR, so it is not prolongation of cons vs prims. The mesh decomposition invokes two MeshBlocks with a logical boundary at $x = 0$. $\theta\mathrm{KS} \sim 1.2$ will mostly be sampled at $\phi_\mathrm{KS}$ locations where there is no intersection with the MeshBlock boundary, however, data near the MeshBlock boundary may enter some samples. I don't think I can say more without digging deeper.

In GitLab by @jmstone216 on Jun 29, 2023, 09:54

mentioned in commit 5b517b0feab0992078e4aab588a05428aa9de55e

IAS-Astrophysics / athenak

Bitwise reproducibility - [merged] #535