RZ CI Tests: Discrepancies Between Local v. Azure Benchmarks - Githubissues

ECP-WarpX / WarpX

WarpX is an advanced electromagnetic & electrostatic Particle-In-Cell code.

https://ecp-warpx.github.io

Other

287 stars 182 forks source link

RZ CI Tests: Discrepancies Between Local v. Azure Benchmarks #2132

Open EZoni opened 3 years ago

EZoni commented 3 years ago

A number of developers have observed occasional discrepancies between the CI benchmarks generated on local machines and the ones generated on Azure, for some of the RZ automated tests.

Cc: @dpgrote @MaxThevenet @RevathiJambunathan @oshapoval @RemiLehe @ax3l Please feel free to edit this description by adding the PRs and tests where you encountered this issue (or do so in the future):

2029: galilean_rz_psatd (see comment)
2111: multi_J_rz_psatd
2302: multi_J_rz_psatd
2411: ElectrostaticSphereEB_mixedBCs

ax3l commented 2 years ago

On DGX, I saw segfaults in FFTW: https://github.com/ECP-WarpX/WarpX/blob/3a6650e45ef4cf9fed32e62228f4f45f4e74303a/Source/FieldSolver/SpectralSolver/SpectralFieldData.cpp#L165-L168

This can also be a compiler/MPI mismatch, but it's worth double-checking out FFTW API contract usage.

Update: probably just an incompatible FFTW module that I loaded quickly on DGX

ax3l commented 2 years ago

Proposed fix for multi_J_rz_psatd and galilean_rz_psatd stability in #2404

ax3l commented 2 years ago

@EZoni still sees a discrepancy locally. Let's see if #2302 gets better now for multi_J_rz_psatd.

ax3l commented 2 years ago

multi_J_rz_psatd will be temporarily disabled in #2411 until we find the origin of the fluctuation.

ax3l commented 2 years ago

New candidate spotted: ElectrostaticSphereEB_mixedBCs differs in analytics (via #2411)

ax3l commented 2 years ago

Likely related: simulations with Gaussian beam might have had system dependent init routines before #2522 / #2523 was fixed. (Thanks for the fix @RemiLehe :tada: )

The fix changed:

LaserAccelerationBoost
LaserAccelerationMR
LaserAccelerationRZ
PlasmaAccelerationBoost2d
PlasmaAccelerationBoost3d
PlasmaAccelerationMR
Python_gaussian_beam
RefinedInjection
RigidInjection_lab
comoving_2d_psatd_hybrid
divb_cleaning_3d
galilean_2d_psatd_hybrid
initial_distribution
restart
restart_psatd
restart_psatd_time_avg

ax3l commented 1 year ago

Although #3965 does not change any test, the test galilean_rz_psatd shows changes in the checksum:

2023-06-06T13:13:03.6959515Z Check numerical stability:
2023-06-06T13:13:03.6959673Z err_energy = 1.862138691139192e-10
2023-06-06T13:13:03.6959806Z tol_energy = 1e-08
2023-06-06T13:13:03.6959988Z ERROR: Benchmark and plotfile checksum have different value for key [lev=0,Ez]
2023-06-06T13:13:03.6960172Z Benchmark: [lev=0,Ez] 4.124588451444761e+03
2023-06-06T13:13:03.6960336Z Plotfile : [lev=0,Ez] 4.124588455632591e+03
2023-06-06T13:13:03.6960624Z Absolute error: 4.19e-06
2023-06-06T13:13:03.6960760Z Relative error: 1.02e-09

Are we using OpenMP in the FFTs maybe that could produce slightly different orders of operations? We could try to disable OpenMP for all RZ tests if we want to avoid parallelization (and check that there is no manual threading in FFTw beyond OpenMP). -> Update: only one OpenMP thread. -> Update: only one MPI rank.

For reductions (e.g., sums in the FFTs) this differnce is reasonably large for machine precision. (Machine precision for reductions is significantly larger than for point-wise operations.)

Another issue that could happen is using slightly different microarchitectures (CPUs) and getting different vectorization intrinsics (e.g., processing 1, 2 or 4 values at a time).

Personally, I would increase the tolerances here for affected tests.