Open EZoni opened 3 years ago
On DGX, I saw segfaults in FFTW: https://github.com/ECP-WarpX/WarpX/blob/3a6650e45ef4cf9fed32e62228f4f45f4e74303a/Source/FieldSolver/SpectralSolver/SpectralFieldData.cpp#L165-L168
This can also be a compiler/MPI mismatch, but it's worth double-checking out FFTW API contract usage.
Update: probably just an incompatible FFTW module that I loaded quickly on DGX
Proposed fix for multi_J_rz_psatd
and galilean_rz_psatd
stability in #2404
@EZoni still sees a discrepancy locally. Let's see if #2302 gets better now for multi_J_rz_psatd
.
multi_J_rz_psatd
will be temporarily disabled in #2411 until we find the origin of the fluctuation.
New candidate spotted: ElectrostaticSphereEB_mixedBCs
differs in analytics (via #2411)
Likely related: simulations with Gaussian beam might have had system dependent init routines before #2522 / #2523 was fixed. (Thanks for the fix @RemiLehe :tada: )
The fix changed:
LaserAccelerationBoost
LaserAccelerationMR
LaserAccelerationRZ
PlasmaAccelerationBoost2d
PlasmaAccelerationBoost3d
PlasmaAccelerationMR
Python_gaussian_beam
RefinedInjection
RigidInjection_lab
comoving_2d_psatd_hybrid
divb_cleaning_3d
galilean_2d_psatd_hybrid
initial_distribution
restart
restart_psatd
restart_psatd_time_avg
Although #3965 does not change any test, the test galilean_rz_psatd
shows changes in the checksum:
2023-06-06T13:13:03.6959515Z Check numerical stability:
2023-06-06T13:13:03.6959673Z err_energy = 1.862138691139192e-10
2023-06-06T13:13:03.6959806Z tol_energy = 1e-08
2023-06-06T13:13:03.6959988Z ERROR: Benchmark and plotfile checksum have different value for key [lev=0,Ez]
2023-06-06T13:13:03.6960172Z Benchmark: [lev=0,Ez] 4.124588451444761e+03
2023-06-06T13:13:03.6960336Z Plotfile : [lev=0,Ez] 4.124588455632591e+03
2023-06-06T13:13:03.6960624Z Absolute error: 4.19e-06
2023-06-06T13:13:03.6960760Z Relative error: 1.02e-09
Are we using OpenMP in the FFTs maybe that could produce slightly different orders of operations? We could try to disable OpenMP for all RZ tests if we want to avoid parallelization (and check that there is no manual threading in FFTw beyond OpenMP). -> Update: only one OpenMP thread. -> Update: only one MPI rank.
For reductions (e.g., sums in the FFTs) this differnce is reasonably large for machine precision. (Machine precision for reductions is significantly larger than for point-wise operations.)
Another issue that could happen is using slightly different microarchitectures (CPUs) and getting different vectorization intrinsics (e.g., processing 1, 2 or 4 values at a time).
Personally, I would increase the tolerances here for affected tests.
A number of developers have observed occasional discrepancies between the CI benchmarks generated on local machines and the ones generated on Azure, for some of the RZ automated tests.
Cc: @dpgrote @MaxThevenet @RevathiJambunathan @oshapoval @RemiLehe @ax3l Please feel free to edit this description by adding the PRs and tests where you encountered this issue (or do so in the future):
2029:
galilean_rz_psatd
(see comment)2111:
multi_J_rz_psatd
2302:
multi_J_rz_psatd
2411:
ElectrostaticSphereEB_mixedBCs