Open jtkrogel opened 1 year ago
The NaN is in the scalar.data but the NaN detector in the wavefunction components was not tripped. => There is most likely a problem with just the MPC computation.
runs_2023-09-11-09-15-23]$ grep -n -i NaN */*.scalar.dat
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680/vmc.s000.scalar.dat:3: 1 -1.2273884599e+03 1.5065084464e+06 -1.8482543823e+03 6.2086592258e+02 -1.2991870771e+04 2.0324763583e+02 5.9220107535e+03 5.0183579990e+03 -nan 4.0320000000e+04 8.9543626972e+01 6.6160342262e-01
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680/vmc.s000.scalar.dat:4: 2 -1.2275045133e+03 1.5067927820e+06 -1.8592057578e+03 6.3170124457e+02 -1.2992550874e+04 2.0367290127e+02 5.9113142159e+03 5.0183579990e+03 -nan 4.0320000000e+04 8.9694526033e+01 6.5901697875e-01
The segfaults are quasi-reproducible when run with the same seed (single node runs in all cases). The reproduction rate is better than 50%.
Below, * indicates segfaults that appear uniquely in a set of runs. All others reproduce. The behavior is likely non-deterministic and any ported fix should rerun a few times for verification.
Original set:
runs_2023-09-11-09-15-23
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-180
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1024
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1680
*FeCl2-tile-3-hyb-0-spo-0-est-0-walk-2400
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3360
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3840
FeCl2-tile-4-hyb-0-spo-0-est-0-walk-720
Reruns:
runs_2023-09-11-12-31-45
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-180
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1024
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1680
*FeCl2-tile-3-hyb-0-spo-0-est-0-walk-2880
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3360
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3840
*FeCl2-tile-4-hyb-0-spo-0-est-0-walk-300
*FeCl2-tile-4-hyb-0-spo-0-est-0-walk-512
FeCl2-tile-4-hyb-0-spo-0-est-0-walk-720
Also, I observed no NaN's in scalar.dat for the reruns.
Describe the bug
Use of MPC is unstable on Frontier (CPU code). A handful of FeCl2 runs have segfaulted, one run has produced NaN.
To Reproduce
Build details:
Problem cases (segfault):
Problem case (NaN in scalar.dat):
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680
Location on Frontier:
/lustre/orion/mat151/proj-shared/ecp_vdw_test_runs/frontier_files/test_runs_jk_cpu/runs_2023-09-11-09-15-23
To reproduce, copy the relevant files in a new directory and resubmit (
sbatch qmc.sbatch.in
).Expected behavior No segfaults or NaN's