QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
306 stars 139 forks source link

The AFQMC fails on CRAY EX #4434

Closed vvp-nsk closed 1 year ago

vvp-nsk commented 1 year ago

Describe the bug Any AFQMC run on CRAY EX fails with the following error message:

***************************************************
****************************************************
****************************************************
          Beginning Driver initialization.
****************************************************
****************************************************
****************************************************

terminate called after throwing an instance of 'std::runtime_error'
  what():  Error: Incorrect global state in require (found uninitialized).

To Reproduce Steps to reproduce the behavior:

  1. QMCPACK v3.16.0
  2. cmake -DQMC_COMPLEX=1 -DBUILD_AFQMC=ON -DCMAKE_SYSTEM_NAME=CrayLinuxEnvironment
  3. srun qmcpack_complex afqmc.xml
  4. Even rather basic Ne example runs into problem.

System:

prckent commented 1 year ago

Thanks for reporting this. Do you happen to know if this is a new problem introduced in v3.16.0 that was not present in v3.15.0? Any issues on other systems?

vvp-nsk commented 1 year ago

Hi!

If I recall correctly, v3.15.0 also suffers from this same problem. I am not 100% sure, but I have suffered from a similar issue on Intel-based IB cluster fall last year.

vvp-nsk commented 1 year ago

Hi!

Just an update. The most recent SW stack by CRAY raises new error message:

ham_factory
-------------------------------------------------------------------------------
/cfs/klemming/projects/snic/teobio/Victor/Develop/qmcpack-gnu-s11/src/AFQMC/Hamiltonians/tests/test_hamiltonian_factory.cpp:137
...............................................................................

/cfs/klemming/projects/snic/teobio/Victor/Develop/qmcpack-gnu-s11/src/AFQMC/Hamiltonians/tests/test_hamiltonian_factory.cpp:137: FAILED:
  {Unknown expression after the reported line}
due to unexpected exception with message:
  cannot call function void boost::mpi3::detail::call(Args ...) [with FT = int
  (void*, int, int, int, int); FT* F = MPI_Bcast; Args = {bool*, int, boost::
  mpi3::detail::basic_datatype<bool>, int, int}; decltype (static_cast<boost::
  mpi3::error>((* F)((declval<Args>)()...)))* <anonymous> = 0]: Invalid
  datatype, error stack:
  PMPI_Bcast(454): MPI_Bcast(buf=0x7fff7ed71adb, count=1, MPI_DATATYPE_NULL,
  root=0, comm=comm=0x84000001) failed
  PMPI_Bcast(412): Datatype for argument datatype is a null datatype

The complete list of SW modules loaded is given below:

 ["PDCTEST"] = "22.06",
  ["atp"] = "3.14.11",
  ["boost"] = "1.79.0-cpeGNU-22.06",
  ["buildtools"] = "22.06",
  ["bzip2"] = "1.0.8",
  ["cpe"] = "22.06",
  ["cpeGNU"] = "22.06",
  ["cray-dsmml"] = "0.2.2",
  ["cray-fftw"] = "3.3.10.1",
  ["cray-hdf5-parallel"] = "1.12.1.5",
  ["cray-libsci"] = "22.06.1.3",
  ["cray-mpich"] = "8.1.17",
  ["cray-pmi"] = "6.1.3",
  ["cray-python"] = "3.9.12.1",
  ["craype"] = "2.7.16",
  ["craype-accel-host"] = "",
  ["craype-network-ofi"] = "",
  ["craype-x86-rome"] = "",
  ["gcc"] = "11.2.0",
  ["icu"] = "69.1",
  ["libfabric"] = "1.15.0.0",
  ["libxml2"] = "2.9.12",
  ["perftools-base"] = "22.06.0",
  ["snic-env"] = "1.0.0",
  ["systemdefault"] = "1.0.0",
  ["xpmem"] = "2.3.2-2.2_9.4__g93dd7ee.shasta",
  ["xz"] = "5.2.5",
  ["zlib"] = "1.2.11",

The following AFQMC tests FAILED:

         44 - deterministic-unit_test_afqmc_hamiltonians_ham_chol_uc (Failed)
         45 - deterministic-unit_test_afqmc_hamiltonians_ham_chol_sc (Failed)
         46 - deterministic-unit_test_afqmc_hamiltonians_ham_thc_sc (Failed)
         48 - deterministic-unit_test_afqmc_wfn_factory_ham_chol_uc_wfn_rhf (Failed)
         49 - deterministic-unit_test_afqmc_wfn_factory_ham_chol_sc_wfn_rhf (Failed)
         50 - deterministic-unit_test_afqmc_wfn_factory_ham_chol_sc_wfn_uhf (Failed)
         51 - deterministic-unit_test_afqmc_wfn_factory_ham_chol_sc_wfn_msd (Failed)
         52 - deterministic-unit_test_afqmc_wfn_factory_ham_thc_sc_wfn_rhf (Failed)
         53 - deterministic-unit_test_afqmc_prop_factory_ham_chol_uc_wfn_rhf (Failed)
         54 - deterministic-unit_test_afqmc_prop_factory_ham_chol_sc_wfn_rhf (Failed)
         55 - deterministic-unit_test_afqmc_prop_factory_ham_chol_sc_wfn_uhf (Failed)
         56 - deterministic-unit_test_afqmc_prop_factory_ham_chol_sc_wfn_msd (Failed)
         57 - deterministic-unit_test_afqmc_prop_factory_ham_thc_sc_wfn_rhf (Failed)
         58 - deterministic-unit_test_afqmc_estimators_ham_chol_uc_wfn_rhf (Failed)
         59 - deterministic-unit_test_afqmc_estimators_ham_chol_sc_wfn_rhf (Failed)
         60 - deterministic-unit_test_afqmc_estimators_ham_chol_sc_wfn_uhf (Failed)
         61 - deterministic-unit_test_afqmc_estimators_ham_chol_sc_wfn_msd (Failed)
         62 - deterministic-unit_test_afqmc_estimators_ham_thc_sc_wfn_rhf (Failed)
        152 - converter_test_pyscf_to_afqmc_01-neon_atom (Failed)
        158 - converter_test_pyscf_to_afqmc_02-neon_frozen_core (Failed)
        164 - converter_test_pyscf_to_afqmc_03-carbon_triplet_uhf (Failed)
        170 - converter_test_pyscf_to_afqmc_04-N2_nomsd (Failed)
        176 - converter_test_pyscf_to_afqmc_05-N2_phmsd (Failed)
        182 - converter_test_pyscf_to_afqmc_06-methane_converge_back_prop (Failed)
        189 - converter_test_pyscf_to_afqmc_07-diamond_2x2x2_supercell (Failed)
        195 - converter_test_pyscf_to_afqmc_08-diamond_2x2x2_kpoint_sym (Failed)
        1928 - short-diamondC_afqmc_1x1x1_complex_cholesky-16-1 (Failed)
        1929 - short-diamondC_afqmc_1x1x1_complex_cholesky-16-1-EnergyEstim__nume_real (Failed)
        1930 - long-diamondC_afqmc_1x1x1_complex_cholesky-16-1 (Failed)
        1931 - long-diamondC_afqmc_1x1x1_complex_cholesky-16-1-EnergyEstim__nume_real (Failed)
        1932 - short-diamondC_afqmc_1x1x1_complex_thc-16-1 (Failed)
        1933 - short-diamondC_afqmc_1x1x1_complex_thc-16-1-EnergyEstim__nume_real (Failed)
        1934 - long-diamondC_afqmc_1x1x1_complex_thc-16-1 (Failed)
        1935 - long-diamondC_afqmc_1x1x1_complex_thc-16-1-EnergyEstim__nume_real (Failed)
correaa commented 1 year ago

I am looking into this. Perhaps Cray MPI doesn't have a MPI_BOOL datatype.

Annoyingly MPI implementations define some constants to Null when they are not available.

vvp-nsk commented 1 year ago

I am sorry about that. Unfortunately, the CRAY-MPICH distribution is only available on Dardel. Probably, one should check for MPI_CXX_BOOL rather than MPI_BOOL?

vvp-nsk commented 1 year ago

image

correaa commented 1 year ago

It is MPI_CXX_BOOL that gives a Null value when compiling without CXX support. (Compiling without CXX support is correct because the old CXX bindings are deprecated).

Now I have replaced MPI_CXX_BOOL with MPI_C_BOOL.

I also took the opportunity to upgrade the sublibrary to the release v0.81 https://github.com/QMCPACK/qmcpack/pull/4458

This version also allows convenient reductions on bool values. These two codes do the same, one as a range of one bool and the other for a stack variable.

    {
        assert(world.size() != 1);

        bool const b = (world.rank() == 1);
        bool any_of = false;
        world.all_reduce_n(&b, 1, &any_of, std::logical_or<>{});
        assert(any_of);

        bool all_of = true;
        world.all_reduce_n(&b, 1, &all_of, std::logical_and<>{});
        assert(not all_of);
    }
    {
        assert(world.size() != 1);

        bool const b = (world.rank() == 1);
        bool const any_of = (world |= b);
        assert(any_of);

        bool const all_of = (world &= b);
        assert(not all_of);
    }

The code above is part of the tests now: https://gitlab.com/correaa/boost-mpi3/-/blob/master/test/all_reduce.cpp#L88-109.

Please try again, as https://github.com/QMCPACK/qmcpack/pull/4458 is merged. If something fails, you can test the MPI library inside qmcpack, like this:

cd external_codes/mpi_wrapper/mpi3
mkdir build ; cd build
module load mpi  # or equivalent
cmake ..
make
ctest
vvp-nsk commented 1 year ago

Hi!

The commit '820650999' solves the problem. Excellent work, indeed!

Some tests still fail because of recent change introduced in Python 3.8:

The following tests FAILED:
        189 - converter_test_pyscf_to_afqmc_07-diamond_2x2x2_supercell (Failed)
        195 - converter_test_pyscf_to_afqmc_08-diamond_2x2x2_kpoint_sym (Failed)
        1928 - short-diamondC_afqmc_1x1x1_complex_cholesky-16-1 (Failed)
        1929 - short-diamondC_afqmc_1x1x1_complex_cholesky-16-1-EnergyEstim__nume_real (Failed)
        1930 - long-diamondC_afqmc_1x1x1_complex_cholesky-16-1 (Failed)
        1931 - long-diamondC_afqmc_1x1x1_complex_cholesky-16-1-EnergyEstim__nume_real (Failed)
        1932 - short-diamondC_afqmc_1x1x1_complex_thc-16-1 (Failed)
        1933 - short-diamondC_afqmc_1x1x1_complex_thc-16-1-EnergyEstim__nume_real (Failed)
        1934 - long-diamondC_afqmc_1x1x1_complex_thc-16-1 (Failed)
        1935 - long-diamondC_afqmc_1x1x1_complex_thc-16-1-EnergyEstim__nume_real (Failed)

The corresponding error message:

AttributeError: module 'time' has no attribute 'clock'
    tstart = time.clock()

The function time.clock() has been removed in Python 3.8, after having been deprecated since Python 3.3.

Thank you!

With best regards, Victor