block-hczhai / block2-preview

Efficient parallel quantum chemistry DMRG in MPO formalism
GNU General Public License v3.0
67 stars 23 forks source link

NPDMs for SC-NEVPT2 #72

Closed MihkuU closed 10 months ago

MihkuU commented 11 months ago

Hi, I am trying to interface your library in our C++ code for performing the DMRG calculations. I would like to be able to perform the SC-NEVPT2 calculations as well and for that need to have the 3RDM, Aux16 and Aux22 matrices. Is there a keyword to obtain these quantities, e.g. like there was the "restart_nevpt2_npdm" in the older Block code? I am looking into 'readthedocs' and the source code but having some difficulty finding it.

Thanks in advance

hczhai commented 11 months ago

Hi, thanks for your interest in using block2.

You can find the explanation of the restart_nevpt2_npdm keyword in https://block2.readthedocs.io/en/latest/user/keywords.html. The required input file and output format are exactly the same as those in the old version of Block. This is supported when you run block2 using the block2main executable. See https://block2.readthedocs.io/en/latest/user/basic.html.

MihkuU commented 11 months ago

Thank you for the answer.

In principle we could install block2main separately with pip and call it from our program to do the DMRG calculation. The problem however is that, we would like to do so in running calculations across multiple nodes. That means, both the executables would have to be called by the queueing system. We want to call Block2 conditionally and from our main program to run over multiple nodes. I don't know how to do that. It would be much more straightforward to link the Block2 'libblock2.so' file and call the Block2 functions from our program. Then multinode parallelization is granted automatically. So I am wondering, if there any plans to add the NEVPT2 capabilities to the Block2 C++ interface?

hczhai commented 11 months ago

Thanks for providing the detailed explanation.

You can use the following code to compute the NEVPT2 intermediates (3pdm, A16, and A22) using only the C++ interface of block2 (with MPI supported):


#include "block2.hpp"

int main() {
    using FL = double;
    using S = block2::SU2;

    // Initalize DMRGDriver
    const size_t stack_mem = 1ULL << 30; // 1 GB
    const std::string scratch = "./tmp";
    const int n_threads = 4;
    std::shared_ptr<block2::DMRGDriver<S, FL>> driver =
        std::make_shared<block2::DMRGDriver<S, FL>>(stack_mem, scratch, "", n_threads);

    // Set MPI
#ifdef _HAS_MPI
    std::shared_ptr<block2::ParallelCommunicator<S>> para_comm = std::make_shared<block2::MPICommunicator<S>>();
#else
    std::shared_ptr<block2::ParallelCommunicator<S>> para_comm =
        std::make_shared<block2::ParallelCommunicator<S>>(1, 0, 0);
#endif
    std::shared_ptr<block2::ParallelRuleSimple<S, FL>> prule =
        std::make_shared<block2::ParallelRuleSimple<S, FL>>(block2::ParallelSimpleTypes::IJ, para_comm);
    driver->prule = prule;

    // Load Integrals (active space)
    std::shared_ptr<block2::FCIDUMP<FL>> fd = std::make_shared<block2::FCIDUMP<FL>>();
    fd->read("O2.ACTIVE.FCIDUMP");
    std::shared_ptr<block2::ParallelFCIDUMP<S, FL>> pfd = std::make_shared<block2::ParallelFCIDUMP<S, FL>>(fd, prule);
    std::shared_ptr<block2::GeneralFCIDUMP<FL>> expr =
        block2::GeneralFCIDUMP<FL>::initialize_from_qc(pfd, block2::ElemOpTypes::SU2, 1E-14)->adjust_order();

    // Sweep Schedule
    const std::vector<block2::ubond_t> bond_dims = {250, 250, 500, 500};
    const std::vector<FL> noises = {1E-4, 1E-5, 0.0};
    const std::vector<FL> thrds = {1E-7, 1E-8, 1E-9, 1E-10};
    const int n_sweeps = 10;
    const double sweep_tol = 1E-8;
    const int verbose = 2;

    // MPO & MPS
    driver->initialize_system(expr->n_sites(), expr->n_elec(), expr->twos());
    std::shared_ptr<block2::MPO<S, FL>> mpo =
        driver->get_mpo(expr, verbose, 1E-14, block2::MPOAlgorithmTypes::FastBipartite);
    std::shared_ptr<block2::MPS<S, FL>> ket = driver->get_random_mps("KET", bond_dims[0]);

    // DMRG & NPDM
    const std::vector<std::string> npdm_exprs = {"((C+((C+(C+D)0)1+D)0)1+D)0", "((C+D)0+(C+D)0)0",
                                                 "((C+D)0+((C+D)0+(C+D)0)0)0", "((C+D)0+((C+D)0+((C+D)0+(C+D)0)0)0)0"};
    const block2::ExpectationAlgorithmTypes npdm_algo_type =
        block2::ExpectationAlgorithmTypes::SymbolFree | block2::ExpectationAlgorithmTypes::Compressed;
    driver->dmrg(mpo, ket, n_sweeps, sweep_tol, bond_dims, noises, thrds, verbose);
    std::vector<std::shared_ptr<block2::GTensor<FL>>> npdms =
        driver->get_npdm(npdm_exprs, ket, ket, 0, npdm_algo_type, verbose);

    if (prule->is_root()) {
        const MKL_INT n_orbs = ket->n_sites;

        auto write_array = [n_orbs](const std::string &fn, block2::NDArray arr) {
            std::ofstream ofs(fn.c_str());
            ofs << n_orbs << std::endl;
            for (size_t i = 0; i < arr.size(); i++) {
                for (auto idx : arr.decompose_linear_index(i))
                    ofs << std::setw(2) << idx << " ";
                ofs << std::setw(20) << std::fixed << std::setprecision(14) << arr.data[arr.linear_index(i)]
                    << std::endl;
            }
            ofs.close();
        };

        block2::NDArray h1e = block2::NDArray(std::vector<MKL_INT>(2, n_orbs), fd->h1e_matrix());
        block2::NDArray g2e = block2::NDArray(std::vector<MKL_INT>(4, n_orbs), fd->g2e_1fold()).transpose({0, 2, 1, 3});
        block2::NDArray dm3_x = block2::NDArray(npdms[0]->shape, *npdms[0]->data); // dm3 in stackblock format
        block2::NDArray dm2 = block2::NDArray(npdms[1]->shape, *npdms[1]->data);
        block2::NDArray dm3 = block2::NDArray(npdms[2]->shape, *npdms[2]->data);
        block2::NDArray dm4 = block2::NDArray(npdms[3]->shape, *npdms[3]->data);
        npdms = std::vector<std::shared_ptr<block2::GTensor<FL>>>();

        // Write 3PDM in StackBlock format
        write_array("spatial_threepdm.0.0.txt", dm3_x);
        dm3_x = block2::NDArray();

        // Compute A16
        block2::NDArray f3ac = block2::NDArray::einsum("ijka,rpqbjcik->rpqbac", {g2e, dm4});
        block2::NDArray f3ca = block2::NDArray::einsum("kcij,rpqbajki->rpqbac", {g2e, dm4});
        dm4 = block2::NDArray();

        block2::NDArray a16 = -block2::NDArray::einsum("ib,rpqiac->pqrabc", {h1e, dm3});
        a16 += block2::NDArray::einsum("ia,rpqbic->pqrabc", {h1e, dm3});
        a16 -= block2::NDArray::einsum("ci,rpqbai->pqrabc", {h1e, dm3});

        a16 -= f3ca.transpose({1, 4, 0, 2, 5, 3});
        a16 -= block2::NDArray::einsum("kbia,rpqcki->pqrabc", {g2e, dm3});
        a16 -= block2::NDArray::einsum("kbaj,rpqjkc->pqrabc", {g2e, dm3});
        a16 += block2::NDArray::einsum("cbij,rpqjai->pqrabc", {g2e, dm3});
        block2::NDArray fdm2 = block2::NDArray::einsum("kbij,rpajki->prab", {g2e, dm3});
        std::vector<block2::NDArraySlice> sl(6, block2::NDArraySlice());
        for (MKL_INT k = 0; k < n_orbs; k++) {
            sl[1] = sl[5] = block2::NDArraySlice(k, -1, 0);
            a16.slice(sl) += fdm2;
        }

        a16 += f3ac.transpose({1, 2, 0, 4, 3, 5});
        a16 -= f3ca.transpose({1, 2, 0, 4, 3, 5});
        a16 += block2::NDArray::einsum("jbij,rpqiac->pqrabc", {g2e, dm3});
        a16 -= block2::NDArray::einsum("cjka,rpqbjk->pqrabc", {g2e, dm3});
        a16 += block2::NDArray::einsum("jcij,rpqbai->pqrabc", {g2e, dm3});

        // Write A16 in StackBlock format
        write_array("A16_matrix.0.0.txt", a16);
        a16 = block2::NDArray();

        // Compute A22
        block2::NDArray a22 = -block2::NDArray::einsum("pb,kipjac->ijkabc", {h1e, dm3});
        a22 -= block2::NDArray::einsum("pa,kibjpc->ijkabc", {h1e, dm3});
        a22 += block2::NDArray::einsum("cp,kibjap->ijkabc", {h1e, dm3});
        a22 += block2::NDArray::einsum("cqra,kibjqr->ijkabc", {g2e, dm3});
        a22 -= block2::NDArray::einsum("qcpq,kibjap->ijkabc", {g2e, dm3});

        a22 -= f3ac.transpose({1, 5, 0, 2, 4, 3});
        fdm2 = block2::NDArray::einsum("pqrb,kiqcpr->ikbc", {g2e, dm3});
        sl = std::vector<block2::NDArraySlice>(6, block2::NDArraySlice());
        for (MKL_INT k = 0; k < n_orbs; k++) {
            sl[1] = sl[3] = block2::NDArraySlice(k, -1, 0);
            a22.slice(sl) -= fdm2;
        }
        a22 -= block2::NDArray::einsum("pqab,kiqjpc->ijkabc", {g2e, dm3});
        a22 += block2::NDArray::einsum("pcrb,kiajpr->ijkabc", {g2e, dm3});
        a22 += block2::NDArray::einsum("cqrb,kiqjar->ijkabc", {g2e, dm3});

        a22 -= f3ac.transpose({1, 3, 0, 4, 2, 5});
        a22 += f3ca.transpose({1, 3, 0, 4, 2, 5});
        a22 += block2::NDArray::einsum("jb,kiac->ijkabc", {h1e, dm2}) * 2.0;
        a22 += block2::NDArray::einsum("pjrb,kiprac->ijkabc", {g2e, dm3}) * 2.0;
        fdm2 = block2::NDArray::einsum("pa,kipc->ikac", {h1e, dm2});
        fdm2 -= block2::NDArray::einsum("cp,kiap->ikac", {h1e, dm2});
        fdm2 -= block2::NDArray::einsum("cqra,kiqr->ikac", {g2e, dm2});
        fdm2 += block2::NDArray::einsum("qcpq,kiap->ikac", {g2e, dm2});
        fdm2 += block2::NDArray::einsum("pqra,kiqcpr->ikac", {g2e, dm3});
        fdm2 -= block2::NDArray::einsum("rcpq,kiaqrp->ikac", {g2e, dm3});
        sl = std::vector<block2::NDArraySlice>(6, block2::NDArraySlice());
        for (MKL_INT k = 0; k < n_orbs; k++) {
            sl[1] = sl[4] = block2::NDArraySlice(k, -1, 0);
            a22.slice(sl) += fdm2 * 2.0;
        }

        // Write A22 in StackBlock format
        write_array("A22_matrix.0.0.txt", a22);
        a22 = block2::NDArray();
    }

    // Release memory
    driver = nullptr;

    return 0;
}

To link this code with libblock2.so, please follow the following steps:

  1. Compile libblock2.so library. A most recent source code of block2 with commit https://github.com/block-hczhai/block2-preview/commit/d61d4610aa9e527fbf89953f7870005dcf9fd8e5 is required. Build it with
cmake .. -DUSE_MKL=ON -DBUILD_CLIB=ON -DMPI=ON -DLARGE_BOND=ON-DUSE_IC=ON -DCMAKE_INSTALL_PREFIX=<INSTALL_DIR>
make
make install

where <INSTALL_DIR> is a folder for storing the include, cmake config, and library files for libblock2.so.

  1. Go to the folder where you have the above main.cpp, create the following CMakeLists.txt:
CMAKE_MINIMUM_REQUIRED(VERSION 3.0)

FIND_PROGRAM(CMAKE_C_COMPILER NAMES $ENV{CC} gcc PATHS ENV PATH NO_DEFAULT_PATH)
FIND_PROGRAM(CMAKE_CXX_COMPILER NAMES $ENV{CXX} g++ PATHS ENV PATH NO_DEFAULT_PATH)

PROJECT(main VERSION 1.0 LANGUAGES CXX)

FIND_PACKAGE(block2 2.0 CONFIG REQUIRED)
FIND_PACKAGE(MPI)

# FIND MKL LIBS - must match those used when building block2
SET(CMAKE_FIND_LIBRARY_SUFFIXES_BKP ${CMAKE_FIND_LIBRARY_SUFFIXES})
SET(CMAKE_FIND_LIBRARY_SUFFIXES "${CMAKE_FIND_LIBRARY_SUFFIXES_BKP};.so.1;.1.dylib;.so.2;.2.dylib;.so.3;.3.dylib;.so.4;.4.dylib")
FIND_LIBRARY(MKL_LIB_LP NAMES mkl_intel_lp64 PATHS $ENV{MKLROOT}/lib $ENV{MKLROOT}/lib/intel64 /usr/local/lib)
FIND_LIBRARY(MKL_LIB_CORE NAMES mkl_core PATHS $ENV{MKLROOT}/lib $ENV{MKLROOT}/lib/intel64 /usr/local/lib)
FIND_LIBRARY(MKL_LIB_GT NAMES mkl_gnu_thread PATHS $ENV{MKLROOT}/lib $ENV{MKLROOT}/lib/intel64 /usr/local/lib)
FIND_LIBRARY(MKL_LIB_AVX NAMES mkl_avx2 PATHS $ENV{MKLROOT}/lib $ENV{MKLROOT}/lib/intel64 /usr/local/lib)
FIND_LIBRARY(MKL_LIB_AVX512 NAMES mkl_avx512 PATHS $ENV{MKLROOT}/lib $ENV{MKLROOT}/lib/intel64 /usr/local/lib)
SET(MKL_LIBS "-Wl,--no-as-needed" ${MKL_LIB_LP} ${MKL_LIB_CORE} ${MKL_LIB_GT} ${OMP_LIB_NAME} ${MKL_LIB_AVX} ${MKL_LIB_AVX512})
SET(CMAKE_FIND_LIBRARY_SUFFIXES ${CMAKE_FIND_LIBRARY_SUFFIXES_BKP})
MESSAGE(STATUS "MKL_LIBS = ${MKL_LIBS}")

# FIND OPENMP LIBS - must match those used when building block2
FIND_LIBRARY(OMP_LIBS NAMES gomp PATHS /usr/local/lib /usr/lib64)
IF (NOT OMP_LIBS)
    EXECUTE_PROCESS(
        COMMAND ${CMAKE_CXX_COMPILER} -print-search-dirs
        COMMAND grep "^lib" COMMAND awk -F "=" "{print $2}" COMMAND tr ":" ";"
        OUTPUT_VARIABLE OMP_LIB_HINT OUTPUT_STRIP_TRAILING_WHITESPACE)
    FIND_LIBRARY(OMP_LIBS NAMES gomp PATHS ${OMP_LIB_HINT})
ENDIF()

ADD_EXECUTABLE(main main.cpp)
TARGET_LINK_LIBRARIES(main PRIVATE ${OMP_LIBS} block2::block2 rt ${MKL_LIBS} ${MPI_CXX_LIBRARIES})
TARGET_COMPILE_OPTIONS(main BEFORE PRIVATE -O3 -funroll-loops -Werror -Werror=return-type -fopenmp ${MPI_COMPILE_FLAGS})
  1. Use cmake to compile the main.cpp:
mkdir build
cd build
export MKLROOT=<MKLROOT>
cmake .. -Dblock2_DIR=<INSTALL_DIR>/share/cmake/block2

Remember to use the correct <MKLROOT> and <INSTALL_DIR>.

  1. Use the following python script (or any other programs) to generate a valid integral file O2.ACTIVE.FCIDUMP for this calculation:
from pyscf import gto, scf, mcscf
from pyblock2._pyscf.ao2mo import integrals as itg
from pyblock2.driver.core import DMRGDriver, SymmetryTypes

mol = gto.M(atom='O 0 0 0; O 0 0 1.207', basis='cc-pvdz', spin=2, verbose=4)
mf = scf.RHF(mol).run(conv_tol=1E-20)

mc = mcscf.CASSCF(mf, 6, 8)
mc.fcisolver.conv_tol = 1e-14
mc.canonicalization = True
mc.natorb = True
mc.run()

mf.mo_coeff = mc.mo_coeff

ncas, n_elec, spin, ecore, h1e, g2e, orb_sym = itg.get_rhf_integrals(mf, ncore=mc.ncore, ncas=mc.ncas, g2e_symm=8)
driver = DMRGDriver(scratch="./tmp", symm_type=SymmetryTypes.SU2)
driver.initialize_system(n_sites=ncas, n_elec=n_elec, spin=spin, orb_sym=orb_sym, singlet_embedding=False)
driver.write_fcidump(h1e, g2e, ecore, filename='O2.ACTIVE.FCIDUMP', h1e_symm=True, pg='d2h')
  1. You can now run the calculation using ./main or mpirun -n 4 ./main. 3PDM, A16, and A22 are stored as text files.
MihkuU commented 11 months ago

Thanks for the in-depth answer! I gave it a try and encountered a few issues: 1) I had to fix some of the paths in the 'block2Config.cmake' file inside Block2 . The 'PACKAGE_PREFIX_DIR' I set to the installation directory. The line 'INCLUDE("${CMAKE_CURRENT_LIST_DIR}/${PN}Targets.cmake")' I replaced with a path to /share/cmake/block2/block2Targets.cmake. I dont know if it is correct or not, I was just trying to get it to work.

2) Finally, I was able to run the 'main' a few times, but it crashed occasionally in parallel. The values of the matrix elements varied a bit with different nr. of processes as well. The numbers looked a bit odd, but I am no expert. I'll attach my 3PDM file just in case: spatial_threepdm.0.0.txt

To give you a perhaps clearer idea of what I am looking for: We are using the old Block code in our multiconfigurational quantum chemistry program to do DMRGSCF and SC-NEVPT2 calculations. Due to incompatibility with newer compilers and libraries, we would like to upgrade. The main quantities of interest are 1,2,3-PDMs, Aux16, Aux22 and of course the electronic energy.

hczhai commented 11 months ago

Could you show me the output of main when you run it in parallel?

MihkuU commented 11 months ago

Sure, from mpirun -n 2 main: out.txt

hczhai commented 11 months ago

Thanks. The output looks ok but the 3pdm is not correct. The problem is likely that you did not compile the code correctly. To make the compilation easier, let's try the following alternative way:

  1. Start with a fresh copy of the most recent version of block2 source code.
  2. In the block2 source code, replace the src/main.cpp with the main.cpp that I provided above.
  3. Compile block2 with the replaced main.cpp to get an executable using cmake .. -DUSE_MKL=ON -DBUILD_CLIB=OFF -DMPI=ON -DLARGE_BOND=ON -DUSE_IC=ON. We do not need make install and you should get the block2 executable in the build directory. Let me know whether mpirun -n 2 ./block2 works or not.
  4. The reference 3pdm is spatial_threepdm.0.0-ref.txt
MihkuU commented 10 months ago

I followed your instructions. I get similar values, but after the 6 digit or so the values differ a bit. Is there some stochastics in use? If I use more than 8 cores, it crashes.

Result with 2 cores: spatial_threepdm.0.0-2cores.txt

My FCIDUMP: O2.ACTIVE.FCIDUMP.txt

hczhai commented 10 months ago

I get similar values, but after the 6 digit or so the values differ a bit.

This is okay. We started from a random initial guessed MPS.

Result with 2 cores:

In main.cpp I wrote const int n_threads = 4 So when you do mpirun -n 1 it already used 4 cores (threads).

If I use more than 8 cores, it crashes.

If you do mpirun -n 8 with const int n_threads = 4 in only one node, you need at least 32 cores in the node. Do you have enough number of cores in the node? When you say it crashes, what exactly do you mean by "crash"?

MihkuU commented 10 months ago

Thanks, I didn't notice the number of threads, now the performance makes more sense. My CPU has 12 cores and 24 threads so it doesn't make sense to run with many MPI processes.

"Crashed" run with 9 MPI procs: broken.txt

hczhai commented 10 months ago

"Crashed" run with 9 MPI procs

You did not say how many threads are used for each MPI proc. If you still use 4 threads, this setting does not make sense on this computer. Also note that I have put const size_t stack_mem = 1ULL << 30; // 1 GB which is the memory per proc. When you do mpirun -n 9 you need at least 9 GB free memory, otherwise you should decrease this number.

My CPU has 12 cores and 24 threads

Hyperthreading is not useful for high performance computing. So we consider this as only 12 physical threads.

Anyway, I think there is basically no further problems with parallel execution. You can adapt main.cpp for interfacing with your C++ program. Please let me know whether you have any other questions.