Fatal Error. Aborting at QMCHamiltonian::evaluate component Kinetic returns NaN

Dankomaister commented 1 year ago

Hi! I have this problem running qmcpack-3.15.0 compiled using intel compilers 2021.4.0. The compiled version passes tests ctest -R qe however when I run my actual calculations (see input_files) I get the following error Fatal Error. Aborting at QMCHamiltonian::evaluate component Kinetic returns NaN

Compiling the same qmcpack version using gnu compilers works fine, passing tests ctest -R qe and finishing my dmc calculation without errors. The error message seems like a custom one from qmcpack? I would like to be able to compile and run qmcpack using intel compilers so any help solving this would be appreciated.

Steps to reproduce the behavior

qmcpack-3.15.0
cmake -DCMAKE_INSTALL_PREFIX=/mnt/lustre/ibs/cmcm/danielhedman/software/eb/software/qmcpack/3.15.0-intel-2021b -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DBOOST_ROOT=/mnt/lustre/ibs/cmcm/danielhedman/software/eb/software/Boost/1.77.0-intel-compilers-2021.4.0 -DBoost_NO_SYSTEM_PATHS=ON /dev/shm/qmcpack/3.15.0/intel-2021b/qmcpack-3.15.0/
run the following calculation: input_files

Expected behavior I expect the behavior to be the same for qmcpack compiled with intel as with gnu compilers, i.e. my calculation completes without errors.

System

same error on two different local HPC clusters
modules loaded: Loading intel/2021b Loading requirement: GCCcore/11.2.0 zlib/1.2.11 binutils/2.37 intel-compilers/2021.4.0 numactl/2.0.14 UCX/1.11.2 impi/2021.4.0 imkl/2021.4.0 imkl-FFTW/2021.4.0 Loading qmcpack/3.15.0 Loading requirement: Szip/2.1.1 HDF5/1.12.1 bzip2/1.0.8 XZ/5.2.5 ICU/69.1 Boost/1.77.0 libxml2/2.9.10 ELPA/2021.11.001 libxc/5.1.6 QuantumESPRESSO/7.1-qmcpack

Additional context Output from the calculation is attached slurm-223491.zip

prckent commented 1 year ago

Thanks for reporting this. Can you confirm the output of ctest -L deterministic ? Does everything pass? This will give us a hint as to whether there is a general problem or perhaps something more restricted to your example input.

You are correct about the NaN trap - we have this as a sanity check inside QMCPACK.

Dankomaister commented 1 year ago

Okay I ran the ctest -L deterministic and it failed on one test ntest_nexus_qdens_radial, this is the final output of ctest -L deterministic

99% tests passed, 1 tests failed out of 1149

Label Time Summary:
QMCPACK                     = 579.21 sec
QMCPACK-checking-results    =  42.83 sec
converter                   =  35.27 sec
coverage                    =  21.96 sec
deterministic               = 910.71 sec
nexus                       = 143.86 sec
quality_unknown             = 859.53 sec
unit                        = 101.23 sec

Total Test time (real) = 913.79 sec

The following tests FAILED:
        2135 - ntest_nexus_qdens_radial (Failed)

From the name I guess this is a nexus test so perhaps not that relevant?

prckent commented 1 year ago

I guess this is a nexus test so perhaps not that relevant?

Correct. This is a test of the qdens tool for analyzing densities and is not relevant here.

The results from the other tests indicates that the code does not have any major issues and that it should be good for production science runs. The tests include VMC and DMC runs for several simple solids.

Dankomaister commented 1 year ago

Okay so these test does not help to narrow down the bug with my calculation. So what is the next step?

prckent commented 1 year ago

Next steps

See if this can be reproduced with the latest develop and not v3.15.0. That said, we haven't changed anything that would obviously result in different behavior.
Find out if anyone else can reproduce this with any of the compilers they have access to, e.g. the latest Intel OneAPI release.
Study the input files to see if there is anything obviously different from similar runs that do run successfully.

It is a holiday here today but we'll discuss among the developers in subsequent days.

prckent commented 1 year ago

FYI, we have reproduced this crash with latest Intel OneAPI compiler. A single determinant (no Jastrow) VMC run can trigger it. So, e.g., something is wrong with the inputs and our processing of them, our construction of the spline orbitals, or perhaps the H5 is somehow bad.

ye-luo commented 1 year ago

@Dankomaister could you provide the following info

cat /etc/os-release # OS info
ldd --version          # glibc version

In addition, in your qmcpack build directory

nm src/QMCWaveFunctions/CMakeFiles/qmcwfs.dir/BsplineFactory/SplineC2R.cpp.o |grep sincos

Dankomaister commented 1 year ago

Hi @ye-luo,

Here is the information you asked for cat /etc/os-release

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

ldd --version

ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

nm src/QMCWaveFunctions/CMakeFiles/qmcwfs.dir/BsplineFactory/SplineC2R.cpp.o | grep sincos

                 U __svml_sincos2_l9
                 U __svml_sincos4_l9
                 U __svml_sincosf4_l9
                 U __svml_sincosf8_l9

prckent commented 1 year ago

As you can tell from the questions, we think this is an issue with vectorization of transcendental functions. This could be a library/compiler issue but we can't rule out an issue on our side yet. Are you able to keep doing production with the GNU compilation? It should not be much slower than an Intel build.

Dankomaister commented 1 year ago

Hi sure I can use the GNU version, but would be nice if this could be fixed.

QMCPACK / qmcpack

Fatal Error. Aborting at QMCHamiltonian::evaluate component Kinetic returns NaN #4396