ECP-WarpX / WarpX

WarpX is an advanced electromagnetic & electrostatic Particle-In-Cell code.
https://ecp-warpx.github.io
Other
287 stars 182 forks source link

RZ CI Tests: Discrepancies Between Local v. Azure Benchmarks #2132

Open EZoni opened 3 years ago

EZoni commented 3 years ago

A number of developers have observed occasional discrepancies between the CI benchmarks generated on local machines and the ones generated on Azure, for some of the RZ automated tests.

Cc: @dpgrote @MaxThevenet @RevathiJambunathan @oshapoval @RemiLehe @ax3l Please feel free to edit this description by adding the PRs and tests where you encountered this issue (or do so in the future):

ax3l commented 2 years ago

On DGX, I saw segfaults in FFTW: https://github.com/ECP-WarpX/WarpX/blob/3a6650e45ef4cf9fed32e62228f4f45f4e74303a/Source/FieldSolver/SpectralSolver/SpectralFieldData.cpp#L165-L168

This can also be a compiler/MPI mismatch, but it's worth double-checking out FFTW API contract usage.

Update: probably just an incompatible FFTW module that I loaded quickly on DGX

ax3l commented 2 years ago

Proposed fix for multi_J_rz_psatd and galilean_rz_psatd stability in #2404

ax3l commented 2 years ago

@EZoni still sees a discrepancy locally. Let's see if #2302 gets better now for multi_J_rz_psatd.

ax3l commented 2 years ago

multi_J_rz_psatd will be temporarily disabled in #2411 until we find the origin of the fluctuation.

ax3l commented 2 years ago

New candidate spotted: ElectrostaticSphereEB_mixedBCs differs in analytics (via #2411)

ax3l commented 2 years ago

Likely related: simulations with Gaussian beam might have had system dependent init routines before #2522 / #2523 was fixed. (Thanks for the fix @RemiLehe :tada: )

The fix changed:

ax3l commented 1 year ago

Although #3965 does not change any test, the test galilean_rz_psatd shows changes in the checksum:

2023-06-06T13:13:03.6959515Z Check numerical stability:
2023-06-06T13:13:03.6959673Z err_energy = 1.862138691139192e-10
2023-06-06T13:13:03.6959806Z tol_energy = 1e-08
2023-06-06T13:13:03.6959988Z ERROR: Benchmark and plotfile checksum have different value for key [lev=0,Ez]
2023-06-06T13:13:03.6960172Z Benchmark: [lev=0,Ez] 4.124588451444761e+03
2023-06-06T13:13:03.6960336Z Plotfile : [lev=0,Ez] 4.124588455632591e+03
2023-06-06T13:13:03.6960624Z Absolute error: 4.19e-06
2023-06-06T13:13:03.6960760Z Relative error: 1.02e-09

Are we using OpenMP in the FFTs maybe that could produce slightly different orders of operations? We could try to disable OpenMP for all RZ tests if we want to avoid parallelization (and check that there is no manual threading in FFTw beyond OpenMP). -> Update: only one OpenMP thread. -> Update: only one MPI rank.

For reductions (e.g., sums in the FFTs) this differnce is reasonably large for machine precision. (Machine precision for reductions is significantly larger than for point-wise operations.)

Another issue that could happen is using slightly different microarchitectures (CPUs) and getting different vectorization intrinsics (e.g., processing 1, 2 or 4 values at a time).

Personally, I would increase the tolerances here for affected tests.