Closed vvp-nsk closed 1 year ago
Thanks for reporting this. Do you happen to know if this is a new problem introduced in v3.16.0 that was not present in v3.15.0? Any issues on other systems?
Hi!
If I recall correctly, v3.15.0 also suffers from this same problem. I am not 100% sure, but I have suffered from a similar issue on Intel-based IB cluster fall last year.
Hi!
Just an update. The most recent SW stack by CRAY raises new error message:
ham_factory
-------------------------------------------------------------------------------
/cfs/klemming/projects/snic/teobio/Victor/Develop/qmcpack-gnu-s11/src/AFQMC/Hamiltonians/tests/test_hamiltonian_factory.cpp:137
...............................................................................
/cfs/klemming/projects/snic/teobio/Victor/Develop/qmcpack-gnu-s11/src/AFQMC/Hamiltonians/tests/test_hamiltonian_factory.cpp:137: FAILED:
{Unknown expression after the reported line}
due to unexpected exception with message:
cannot call function void boost::mpi3::detail::call(Args ...) [with FT = int
(void*, int, int, int, int); FT* F = MPI_Bcast; Args = {bool*, int, boost::
mpi3::detail::basic_datatype<bool>, int, int}; decltype (static_cast<boost::
mpi3::error>((* F)((declval<Args>)()...)))* <anonymous> = 0]: Invalid
datatype, error stack:
PMPI_Bcast(454): MPI_Bcast(buf=0x7fff7ed71adb, count=1, MPI_DATATYPE_NULL,
root=0, comm=comm=0x84000001) failed
PMPI_Bcast(412): Datatype for argument datatype is a null datatype
The complete list of SW modules loaded is given below:
["PDCTEST"] = "22.06",
["atp"] = "3.14.11",
["boost"] = "1.79.0-cpeGNU-22.06",
["buildtools"] = "22.06",
["bzip2"] = "1.0.8",
["cpe"] = "22.06",
["cpeGNU"] = "22.06",
["cray-dsmml"] = "0.2.2",
["cray-fftw"] = "3.3.10.1",
["cray-hdf5-parallel"] = "1.12.1.5",
["cray-libsci"] = "22.06.1.3",
["cray-mpich"] = "8.1.17",
["cray-pmi"] = "6.1.3",
["cray-python"] = "3.9.12.1",
["craype"] = "2.7.16",
["craype-accel-host"] = "",
["craype-network-ofi"] = "",
["craype-x86-rome"] = "",
["gcc"] = "11.2.0",
["icu"] = "69.1",
["libfabric"] = "1.15.0.0",
["libxml2"] = "2.9.12",
["perftools-base"] = "22.06.0",
["snic-env"] = "1.0.0",
["systemdefault"] = "1.0.0",
["xpmem"] = "2.3.2-2.2_9.4__g93dd7ee.shasta",
["xz"] = "5.2.5",
["zlib"] = "1.2.11",
The following AFQMC tests FAILED:
44 - deterministic-unit_test_afqmc_hamiltonians_ham_chol_uc (Failed)
45 - deterministic-unit_test_afqmc_hamiltonians_ham_chol_sc (Failed)
46 - deterministic-unit_test_afqmc_hamiltonians_ham_thc_sc (Failed)
48 - deterministic-unit_test_afqmc_wfn_factory_ham_chol_uc_wfn_rhf (Failed)
49 - deterministic-unit_test_afqmc_wfn_factory_ham_chol_sc_wfn_rhf (Failed)
50 - deterministic-unit_test_afqmc_wfn_factory_ham_chol_sc_wfn_uhf (Failed)
51 - deterministic-unit_test_afqmc_wfn_factory_ham_chol_sc_wfn_msd (Failed)
52 - deterministic-unit_test_afqmc_wfn_factory_ham_thc_sc_wfn_rhf (Failed)
53 - deterministic-unit_test_afqmc_prop_factory_ham_chol_uc_wfn_rhf (Failed)
54 - deterministic-unit_test_afqmc_prop_factory_ham_chol_sc_wfn_rhf (Failed)
55 - deterministic-unit_test_afqmc_prop_factory_ham_chol_sc_wfn_uhf (Failed)
56 - deterministic-unit_test_afqmc_prop_factory_ham_chol_sc_wfn_msd (Failed)
57 - deterministic-unit_test_afqmc_prop_factory_ham_thc_sc_wfn_rhf (Failed)
58 - deterministic-unit_test_afqmc_estimators_ham_chol_uc_wfn_rhf (Failed)
59 - deterministic-unit_test_afqmc_estimators_ham_chol_sc_wfn_rhf (Failed)
60 - deterministic-unit_test_afqmc_estimators_ham_chol_sc_wfn_uhf (Failed)
61 - deterministic-unit_test_afqmc_estimators_ham_chol_sc_wfn_msd (Failed)
62 - deterministic-unit_test_afqmc_estimators_ham_thc_sc_wfn_rhf (Failed)
152 - converter_test_pyscf_to_afqmc_01-neon_atom (Failed)
158 - converter_test_pyscf_to_afqmc_02-neon_frozen_core (Failed)
164 - converter_test_pyscf_to_afqmc_03-carbon_triplet_uhf (Failed)
170 - converter_test_pyscf_to_afqmc_04-N2_nomsd (Failed)
176 - converter_test_pyscf_to_afqmc_05-N2_phmsd (Failed)
182 - converter_test_pyscf_to_afqmc_06-methane_converge_back_prop (Failed)
189 - converter_test_pyscf_to_afqmc_07-diamond_2x2x2_supercell (Failed)
195 - converter_test_pyscf_to_afqmc_08-diamond_2x2x2_kpoint_sym (Failed)
1928 - short-diamondC_afqmc_1x1x1_complex_cholesky-16-1 (Failed)
1929 - short-diamondC_afqmc_1x1x1_complex_cholesky-16-1-EnergyEstim__nume_real (Failed)
1930 - long-diamondC_afqmc_1x1x1_complex_cholesky-16-1 (Failed)
1931 - long-diamondC_afqmc_1x1x1_complex_cholesky-16-1-EnergyEstim__nume_real (Failed)
1932 - short-diamondC_afqmc_1x1x1_complex_thc-16-1 (Failed)
1933 - short-diamondC_afqmc_1x1x1_complex_thc-16-1-EnergyEstim__nume_real (Failed)
1934 - long-diamondC_afqmc_1x1x1_complex_thc-16-1 (Failed)
1935 - long-diamondC_afqmc_1x1x1_complex_thc-16-1-EnergyEstim__nume_real (Failed)
I am looking into this. Perhaps Cray MPI doesn't have a MPI_BOOL datatype.
Annoyingly MPI implementations define some constants to Null when they are not available.
I am sorry about that. Unfortunately, the CRAY-MPICH distribution is only available on Dardel. Probably, one should check for MPI_CXX_BOOL rather than MPI_BOOL?
It is MPI_CXX_BOOL
that gives a Null value when compiling without CXX support. (Compiling without CXX support is correct because the old CXX bindings are deprecated).
Now I have replaced MPI_CXX_BOOL
with MPI_C_BOOL
.
I also took the opportunity to upgrade the sublibrary to the release v0.81 https://github.com/QMCPACK/qmcpack/pull/4458
This version also allows convenient reductions on bool values. These two codes do the same, one as a range of one bool and the other for a stack variable.
{
assert(world.size() != 1);
bool const b = (world.rank() == 1);
bool any_of = false;
world.all_reduce_n(&b, 1, &any_of, std::logical_or<>{});
assert(any_of);
bool all_of = true;
world.all_reduce_n(&b, 1, &all_of, std::logical_and<>{});
assert(not all_of);
}
{
assert(world.size() != 1);
bool const b = (world.rank() == 1);
bool const any_of = (world |= b);
assert(any_of);
bool const all_of = (world &= b);
assert(not all_of);
}
The code above is part of the tests now: https://gitlab.com/correaa/boost-mpi3/-/blob/master/test/all_reduce.cpp#L88-109.
Please try again, as https://github.com/QMCPACK/qmcpack/pull/4458 is merged. If something fails, you can test the MPI library inside qmcpack, like this:
cd external_codes/mpi_wrapper/mpi3
mkdir build ; cd build
module load mpi # or equivalent
cmake ..
make
ctest
Hi!
The commit '820650999' solves the problem. Excellent work, indeed!
Some tests still fail because of recent change introduced in Python 3.8:
The following tests FAILED:
189 - converter_test_pyscf_to_afqmc_07-diamond_2x2x2_supercell (Failed)
195 - converter_test_pyscf_to_afqmc_08-diamond_2x2x2_kpoint_sym (Failed)
1928 - short-diamondC_afqmc_1x1x1_complex_cholesky-16-1 (Failed)
1929 - short-diamondC_afqmc_1x1x1_complex_cholesky-16-1-EnergyEstim__nume_real (Failed)
1930 - long-diamondC_afqmc_1x1x1_complex_cholesky-16-1 (Failed)
1931 - long-diamondC_afqmc_1x1x1_complex_cholesky-16-1-EnergyEstim__nume_real (Failed)
1932 - short-diamondC_afqmc_1x1x1_complex_thc-16-1 (Failed)
1933 - short-diamondC_afqmc_1x1x1_complex_thc-16-1-EnergyEstim__nume_real (Failed)
1934 - long-diamondC_afqmc_1x1x1_complex_thc-16-1 (Failed)
1935 - long-diamondC_afqmc_1x1x1_complex_thc-16-1-EnergyEstim__nume_real (Failed)
The corresponding error message:
AttributeError: module 'time' has no attribute 'clock'
tstart = time.clock()
The function time.clock() has been removed in Python 3.8, after having been deprecated since Python 3.3.
Thank you!
With best regards, Victor
Describe the bug Any AFQMC run on CRAY EX fails with the following error message:
To Reproduce Steps to reproduce the behavior:
cmake -DQMC_COMPLEX=1 -DBUILD_AFQMC=ON -DCMAKE_SYSTEM_NAME=CrayLinuxEnvironment
srun qmcpack_complex afqmc.xml
System: