QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
306 stars 139 forks source link

MPC causes segfault on Frontier #4725

Open jtkrogel opened 1 year ago

jtkrogel commented 1 year ago

Describe the bug

Use of MPC is unstable on Frontier (CPU code). A handful of FeCl2 runs have segfaulted, one run has produced NaN.

To Reproduce

Build details:

  Git branch: develop
  Last git commit: 283f2438770bdfb592d161d287771764cbf6f96c
  Last git commit date: Sat Aug 26 09:36:21 2023 -0500
  Last git commit subject: Merge pull request #4715 from QMCPACK/prckent-patch-1

Currently Loaded Modules:
  1) craype-x86-trento                      13) darshan-runtime/3.4.0
  2) libfabric/1.15.2.0                     14) hsi/default
  3) craype-network-ofi                     15) DefApps/default
  4) perftools-base/22.12.0                 16) emacs/28.1
  5) xpmem/2.6.2-2.5_2.22__gd067c3f.shasta  17) cmake/3.23.2
  6) cray-pmi/6.1.8                         18) openblas/0.3.17
  7) cce/15.0.0                             19) cray-fftw/3.3.10.3
  8) craype/2.7.19                          20) hdf5/1.14.0
  9) cray-dsmml/0.2.2                       21) boost/1.79.0
 10) cray-mpich/8.1.23                      22) rocm/5.5.1
 11) cray-libsci/22.12.1.1                  23) ninja/1.10.2
 12) PrgEnv-cray/8.3.3

Executable:
/lustre/orion/world-shared/mat151/pk7/try_frontier/build_frontier_cpu_real_MP/bin/qmcpack

Problem cases (segfault):

FeCl2-tile-3-hyb-0-spo-0-est-0-walk-180/qmc.out:srun:  error: frontier04992: task 7: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1024/qmc.out:srun: error: frontier08960: task 4: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1680/qmc.out:srun: error: frontier10366: task 6: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-2400/qmc.out:srun: error: frontier00384: task 5: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3360/qmc.out:srun: error: frontier08319: task 3: Segmentation fault (core dumped)
FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3840/qmc.out:srun: error: frontier00208: task 6: Segmentation fault (core dumped)
FeCl2-tile-4-hyb-0-spo-0-est-0-walk-720/qmc.out:srun:  error: frontier00201: task 0: Segmentation fault (core dumped)

Problem case (NaN in scalar.dat): FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680

Location on Frontier: /lustre/orion/mat151/proj-shared/ecp_vdw_test_runs/frontier_files/test_runs_jk_cpu/runs_2023-09-11-09-15-23

To reproduce, copy the relevant files in a new directory and resubmit (sbatch qmc.sbatch.in).

Expected behavior No segfaults or NaN's

prckent commented 1 year ago

The NaN is in the scalar.data but the NaN detector in the wavefunction components was not tripped. => There is most likely a problem with just the MPC computation.

runs_2023-09-11-09-15-23]$ grep -n -i NaN */*.scalar.dat
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680/vmc.s000.scalar.dat:3:         1   -1.2273884599e+03    1.5065084464e+06   -1.8482543823e+03    6.2086592258e+02   -1.2991870771e+04    2.0324763583e+02    5.9220107535e+03    5.0183579990e+03                -nan    4.0320000000e+04    8.9543626972e+01    6.6160342262e-01
FeCl2-tile-2-hyb-0-spo-0-est-0-walk-1680/vmc.s000.scalar.dat:4:         2   -1.2275045133e+03    1.5067927820e+06   -1.8592057578e+03    6.3170124457e+02   -1.2992550874e+04    2.0367290127e+02    5.9113142159e+03    5.0183579990e+03                -nan    4.0320000000e+04    8.9694526033e+01    6.5901697875e-01
jtkrogel commented 1 year ago

The segfaults are quasi-reproducible when run with the same seed (single node runs in all cases). The reproduction rate is better than 50%.

Below, * indicates segfaults that appear uniquely in a set of runs. All others reproduce. The behavior is likely non-deterministic and any ported fix should rerun a few times for verification.

Original set:

runs_2023-09-11-09-15-23
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-180 
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1024
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1680
  *FeCl2-tile-3-hyb-0-spo-0-est-0-walk-2400
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3360
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3840
   FeCl2-tile-4-hyb-0-spo-0-est-0-walk-720 

Reruns:

runs_2023-09-11-12-31-45
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-180 
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1024
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-1680
  *FeCl2-tile-3-hyb-0-spo-0-est-0-walk-2880
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3360
   FeCl2-tile-3-hyb-0-spo-0-est-0-walk-3840
  *FeCl2-tile-4-hyb-0-spo-0-est-0-walk-300 
  *FeCl2-tile-4-hyb-0-spo-0-est-0-walk-512 
   FeCl2-tile-4-hyb-0-spo-0-est-0-walk-720 

Also, I observed no NaN's in scalar.dat for the reruns.