QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
294 stars 137 forks source link

Fatal Error. Aborting at QMCHamiltonian::evaluate component Kinetic returns NaN #4396

Open Dankomaister opened 1 year ago

Dankomaister commented 1 year ago

Hi! I have this problem running qmcpack-3.15.0 compiled using intel compilers 2021.4.0. The compiled version passes tests ctest -R qe however when I run my actual calculations (see input_files) I get the following error Fatal Error. Aborting at QMCHamiltonian::evaluate component Kinetic returns NaN

Compiling the same qmcpack version using gnu compilers works fine, passing tests ctest -R qe and finishing my dmc calculation without errors. The error message seems like a custom one from qmcpack? I would like to be able to compile and run qmcpack using intel compilers so any help solving this would be appreciated.

Steps to reproduce the behavior

  1. qmcpack-3.15.0
  2. cmake -DCMAKE_INSTALL_PREFIX=/mnt/lustre/ibs/cmcm/danielhedman/software/eb/software/qmcpack/3.15.0-intel-2021b -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=ON -DCMAKE_FIND_USE_PACKAGE_REGISTRY=OFF -DBOOST_ROOT=/mnt/lustre/ibs/cmcm/danielhedman/software/eb/software/Boost/1.77.0-intel-compilers-2021.4.0 -DBoost_NO_SYSTEM_PATHS=ON /dev/shm/qmcpack/3.15.0/intel-2021b/qmcpack-3.15.0/
  3. run the following calculation: input_files

Expected behavior I expect the behavior to be the same for qmcpack compiled with intel as with gnu compilers, i.e. my calculation completes without errors.

System

Additional context Output from the calculation is attached slurm-223491.zip

prckent commented 1 year ago

Thanks for reporting this. Can you confirm the output of ctest -L deterministic ? Does everything pass? This will give us a hint as to whether there is a general problem or perhaps something more restricted to your example input.

You are correct about the NaN trap - we have this as a sanity check inside QMCPACK.

Dankomaister commented 1 year ago

Okay I ran the ctest -L deterministic and it failed on one test ntest_nexus_qdens_radial, this is the final output of ctest -L deterministic

99% tests passed, 1 tests failed out of 1149

Label Time Summary:
QMCPACK                     = 579.21 sec
QMCPACK-checking-results    =  42.83 sec
converter                   =  35.27 sec
coverage                    =  21.96 sec
deterministic               = 910.71 sec
nexus                       = 143.86 sec
quality_unknown             = 859.53 sec
unit                        = 101.23 sec

Total Test time (real) = 913.79 sec

The following tests FAILED:
        2135 - ntest_nexus_qdens_radial (Failed)

From the name I guess this is a nexus test so perhaps not that relevant?

prckent commented 1 year ago

I guess this is a nexus test so perhaps not that relevant?

Correct. This is a test of the qdens tool for analyzing densities and is not relevant here.

The results from the other tests indicates that the code does not have any major issues and that it should be good for production science runs. The tests include VMC and DMC runs for several simple solids.

Dankomaister commented 1 year ago

Okay so these test does not help to narrow down the bug with my calculation. So what is the next step?

prckent commented 1 year ago

Next steps

It is a holiday here today but we'll discuss among the developers in subsequent days.

prckent commented 1 year ago

FYI, we have reproduced this crash with latest Intel OneAPI compiler. A single determinant (no Jastrow) VMC run can trigger it. So, e.g., something is wrong with the inputs and our processing of them, our construction of the spline orbitals, or perhaps the H5 is somehow bad.

ye-luo commented 1 year ago

@Dankomaister could you provide the following info

cat /etc/os-release # OS info
ldd --version          # glibc version

In addition, in your qmcpack build directory

nm src/QMCWaveFunctions/CMakeFiles/qmcwfs.dir/BsplineFactory/SplineC2R.cpp.o |grep sincos
Dankomaister commented 1 year ago

Hi @ye-luo,

Here is the information you asked for cat /etc/os-release

NAME="CentOS Linux"
VERSION="7 (Core)"
ID="centos"
ID_LIKE="rhel fedora"
VERSION_ID="7"
PRETTY_NAME="CentOS Linux 7 (Core)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:centos:centos:7"
HOME_URL="https://www.centos.org/"
BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7"
CENTOS_MANTISBT_PROJECT_VERSION="7"
REDHAT_SUPPORT_PRODUCT="centos"
REDHAT_SUPPORT_PRODUCT_VERSION="7"

ldd --version

ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.

nm src/QMCWaveFunctions/CMakeFiles/qmcwfs.dir/BsplineFactory/SplineC2R.cpp.o | grep sincos

                 U __svml_sincos2_l9
                 U __svml_sincos4_l9
                 U __svml_sincosf4_l9
                 U __svml_sincosf8_l9
prckent commented 1 year ago

As you can tell from the questions, we think this is an issue with vectorization of transcendental functions. This could be a library/compiler issue but we can't rule out an issue on our side yet. Are you able to keep doing production with the GNU compilation? It should not be much slower than an Intel build.

Dankomaister commented 1 year ago

Hi sure I can use the GNU version, but would be nice if this could be fixed.