block-hczhai / block2-preview

Efficient parallel quantum chemistry DMRG in MPO formalism
GNU General Public License v3.0
67 stars 23 forks source link

DMRGSCF with pyscf fails on mpirun with certain imports #89

Closed HehnLukas closed 6 months ago

HehnLukas commented 7 months ago

Hello,

I have encountered a strange issue with block 2 and pyscf 2.4.0, and I am not sure which the issue comes from. When I run a simple DMRG calculation with the following code, the calculation fails if I include the imports of cc or ci, but runs normally when I remove those. The order of the imports does not seem to matter.

(I also posted this on the PYSCF github: https://github.com/pyscf/pyscf/issues/2168

import pyscf
#The following lines make the script fail
from pyscf import ci
from pyscf import cc
#The following line does not
from pyscf import gto, scf, mcscf, ao2mo, fci, mp

from pyscf import dmrgscf

def do_simple_dmrg(mf, orbitals, electrons, caslist):
    print("in the function")
    mc = dmrgscf.DMRGSCF(mf,orbitals,electrons)
    mo = mc.sort_mo(caslist)
    mc.fcisolver = dmrgscf.DMRGCI(mc.mol,  memory = 300, num_thrds = 64)
    mc.max_cycle = 30
    emc = mc.kernel(mo)[0]
    return emc

x_n = 1
atom_n2 = [
            ['N', ( 0.5*x_n, 0.    , 0.    )],
            ['N', ( -0.5*x_n, 0.    , 0.    )],]
mol = pyscf.gto.Mole()

mol.verbose = 4
mol.atom = atom_n2
mol.basis = 'cc-pvdz'
mol.build()
mf = mol.UHF(mol)
mf.kernel()
ehf = mf.e_tot
caslist = [5,6,7,8]
dmrgmc = do_simple_dmrg(mf,4,4,caslist)

This is the message I get on failure:

******** Block flags ********
executable             = /gpfs/software/milan9/v01/apps/linux-rhel9-zen3/gcc-12.2.0/py-pyblock2-0.5.2-iei4go3vu7nkibxtrknlruhcfnoyqjk3/bin/block2main
BLOCKEXE_COMPRESS_NEVPT= /gpfs/software/milan9/v01/apps/linux-rhel9-zen3/gcc-12.2.0/py-pyblock2-0.5.2-iei4go3vu7nkibxtrknlruhcfnoyqjk3/bin/block2main
Block version          = 0.5.2
mpiprefix              = mpirun
scratchDirectory       = /gpfs-hot/work/pbs.2985856.pbsp1.hpc.basf.net/1826913
integralFile           = ./FCIDUMP
configFile             = ./dmrg.conf
outputFile             = ./dmrg.out
maxIter                = 32
scheduleSweeps         = [0, 4, 8, 12, 14, 16, 18, 20]
scheduleMaxMs          = [200, 400, 800, 1000, 1000, 1000, 1000, 1000]
scheduleTols           = [0.0001, 0.0001, 0.0001, 0.0001, 1e-05, 1.0000000000000002e-06, 1.0000000000000002e-07, 1e-08]
scheduleNoises         = [0.0001, 0.0001, 0.0001, 0.0001, 1e-05, 1.0000000000000002e-06, 1.0000000000000002e-07, 0.0]
twodot_to_onedot       = 24
tol                    = 1e-07
maxM                   = 1000
dmrg switch tol        = 0.001
wfnsym                 = 1
fullrestart            = False
num_thrds              = 64
memory                 = 300

Traceback (most recent call last):
  File "/gpfs/users/home/hehnl/maniqu/pyscf/collab_paper/other_molecules/N_2/zs1_test/dmrg_test/error_script/./script.py", line 28, in <module>
    dmrgmc = do_simple_dmrg(mf,4,4,caslist)
  File "/gpfs/users/home/hehnl/maniqu/pyscf/collab_paper/other_molecules/N_2/zs1_test/dmrg_test/error_script/./script.py", line 11, in do_simple_dmrg
    emc = mc.kernel(mo)[0]
  File "/gpfs/software/milan9/v01/apps/linux-rhel9-zen3/gcc-12.2.0/py-pyscf-2.4.0-7cn3utldmxrznr7xznylqzisq77shvgk/lib/python3.10/site-packages/pyscf/mcscf/mc1step.py", line 861, in kernel
    _kern(self, mo_coeff,
  File "/gpfs/software/milan9/v01/apps/linux-rhel9-zen3/gcc-12.2.0/py-pyscf-2.4.0-7cn3utldmxrznr7xznylqzisq77shvgk/lib/python3.10/site-packages/pyscf/mcscf/mc1step.py", line 351, in kernel
    e_tot, e_cas, fcivec = casscf.casci(mo, ci0, eris, log, locals())
  File "/gpfs/software/milan9/v01/apps/linux-rhel9-zen3/gcc-12.2.0/py-pyscf-2.4.0-7cn3utldmxrznr7xznylqzisq77shvgk/lib/python3.10/site-packages/pyscf/mcscf/mc1step.py", line 878, in casci
    e_tot, e_cas, fcivec = casci.kernel(fcasci, mo_coeff, ci0, log,
  File "/gpfs/software/milan9/v01/apps/linux-rhel9-zen3/gcc-12.2.0/py-pyscf-2.4.0-7cn3utldmxrznr7xznylqzisq77shvgk/lib/python3.10/site-packages/pyscf/mcscf/casci.py", line 608, in kernel
    e_tot, fcivec = casci.fcisolver.kernel(h1eff, eri_cas, ncas, nelecas,
  File "/gpfs/software/milan9/v01/apps/linux-rhel9-zen3/gcc-12.2.0/py-pyscf-dmrgscf-0.1.0-klaocx7k4iybiqyzm4ayphfgpwgbf75x/lib/python3.10/site-packages/pyscf/dmrgscf/dmrgci.py", line 713, in kernel
    writeIntegralFile(self, h1e, eri, norb, nelec, ecore)
  File "/gpfs/software/milan9/v01/apps/linux-rhel9-zen3/gcc-12.2.0/py-pyscf-dmrgscf-0.1.0-klaocx7k4iybiqyzm4ayphfgpwgbf75x/lib/python3.10/site-packages/pyscf/dmrgscf/dmrgci.py", line 932, in writeIntegralFile
    check_call(cmd, shell=True)
  File "/gpfs/software/milan9/v01/apps/linux-rhel9-zen3/gcc-12.2.0/python-3.10.8-nqfwypz23qwrtchqyswjbmf4ehfubk6g/lib/python3.10/subprocess.py", line 369, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'mpirun mkdir -p /gpfs-hot/work/pbs.2985856.pbsp1.hpc.basf.net/1826913' returned non-zero exit status 1.

While without the imports it runs normally:

******** Block flags ********
executable             = /gpfs/software/milan9/v01/apps/linux-rhel9-zen3/gcc-12.2.0/py-pyblock2-0.5.2-iei4go3vu7nkibxtrknlruhcfnoyqjk3/bin/block2main
BLOCKEXE_COMPRESS_NEVPT= /gpfs/software/milan9/v01/apps/linux-rhel9-zen3/gcc-12.2.0/py-pyblock2-0.5.2-iei4go3vu7nkibxtrknlruhcfnoyqjk3/bin/block2main
Block version          = 0.5.2
mpiprefix              = mpirun
scratchDirectory       = /gpfs-hot/work/pbs.2984668.pbsp1.hpc.basf.net/2725084
integralFile           = ./FCIDUMP
configFile             = ./dmrg.conf
outputFile             = ./dmrg.out
maxIter                = 32
scheduleSweeps         = [0, 4, 8, 12, 14, 16, 18, 20]
scheduleMaxMs          = [200, 400, 800, 1000, 1000, 1000, 1000, 1000]
scheduleTols           = [0.0001, 0.0001, 0.0001, 0.0001, 1e-05, 1.0000000000000002e-06, 1.0000000000000002e-07, 1e-08]
scheduleNoises         = [0.0001, 0.0001, 0.0001, 0.0001, 1e-05, 1.0000000000000002e-06, 1.0000000000000002e-07, 0.0]
twodot_to_onedot       = 24
tol                    = 1e-07
maxM                   = 1000
dmrg switch tol        = 0.001
wfnsym                 = 1
fullrestart            = False
num_thrds              = 64
memory                 = 300

CASCI E = -108.028456125033  S^2 = 0.0000000
Set conv_tol_grad to 0.000316228
macro iter   1 ( 21 JK    4 micro), CASSCF E = -108.084219915079  dE = -5.57637900e-02  S^2 = 0.0000000
               |grad[o]|=0.389  |ddm|=0.0573  |maxRot[o]|=0.212
macro iter   2 ( 21 JK    4 micro), CASSCF E = -108.247213844736  dE = -1.62993930e-01  S^2 = 0.0000000
               |grad[o]|=0.171  |ddm|= 1.31  |maxRot[o]|= 0.31
macro iter   3 ( 21 JK    4 micro), CASSCF E = -108.678829826654  dE = -4.31615982e-01  S^2 = 0.0000000
               |grad[o]|=0.527  |ddm|=0.067  |maxRot[o]|= 0.31
macro iter   4 ( 21 JK    4 micro), CASSCF E = -108.944851670020  dE = -2.66021843e-01  S^2 = 0.0000000
               |grad[o]|=0.478  |ddm|=0.235  |maxRot[o]|=0.309
macro iter   5 ( 21 JK    4 micro), CASSCF E = -108.954985670224  dE = -1.01340002e-02  S^2 = 0.0000000
               |grad[o]|=0.131  |ddm|=0.098  |maxRot[o]|=0.161
macro iter   6 ( 21 JK    4 micro), CASSCF E = -108.968874331442  dE = -1.38886612e-02  S^2 = 0.0000000
               |grad[o]|=0.0835  |ddm|=0.0464  |maxRot[o]|=0.314
macro iter   7 ( 21 JK    4 micro), CASSCF E = -108.986829089725  dE = -1.79547583e-02  S^2 = 0.0000000
               |grad[o]|=0.0703  |ddm|=0.0219  |maxRot[o]|=0.314
macro iter   8 ( 21 JK    4 micro), CASSCF E = -108.998395580514  dE = -1.15664908e-02  S^2 = 0.0000000
               |grad[o]|=0.055  |ddm|=0.0106  |maxRot[o]|=0.265
macro iter   9 ( 21 JK    4 micro), CASSCF E = -109.002534471929  dE = -4.13889142e-03  S^2 = 0.0000000
               |grad[o]|=0.031  |ddm|=0.00807  |maxRot[o]|=0.151
macro iter  10 ( 21 JK    4 micro), CASSCF E = -109.005804040309  dE = -3.26956838e-03  S^2 = 0.0000000
               |grad[o]|=0.0215  |ddm|=0.00279  |maxRot[o]|=0.159
macro iter  11 ( 11 JK    3 micro), CASSCF E = -109.007218098148  dE = -1.41405784e-03  S^2 = 0.0000000
               |grad[o]|=0.0134  |ddm|=0.00864  |maxRot[o]|=0.129
macro iter  12 ( 14 JK    4 micro), CASSCF E = -109.007374632689  dE = -1.56534541e-04  S^2 = 0.0000000
               |grad[o]|=0.0053  |ddm|=0.00117  |maxRot[o]|=0.0495
macro iter  13 ( 10 JK    2 micro), CASSCF E = -109.007378934723  dE = -4.30203394e-06  S^2 = 0.0000000
               |grad[o]|=0.00164  |ddm|=0.000124  |maxRot[o]|=0.00822
macro iter  14 (  3 JK    1 micro), CASSCF E = -109.007378937588  dE = -2.86490831e-09  S^2 = 0.0000000
               |grad[o]|=5.98e-05  |ddm|=9.58e-15  |maxRot[o]|=0.000144
1-step CASSCF converged in  14 macro (248 JK  50 micro) steps
CASSCF canonicalization
Density matrix diagonal elements [0.88752581 1.11793779 1.94853628 0.04600012]
CASSCF energy = -109.007378937588
CASCI E = -109.007378937588  E(CI) = -6.09125031769085  S^2 = 0.0000000
hczhai commented 7 months ago

Thanks for reporting the problem.

First of all, when you do DMRGSCF calculations using block2, please carefully read and follow the examples given in the block2 documentation: https://block2.readthedocs.io/en/latest/user/dmrg-scf.html#dmrgscf-serial . In particular, the following parts of your script may be revised:

  1. If you are using only one node, add the following explicitly in this script:
dmrgscf.settings.BLOCKEXE = os.popen("which block2main").read().strip()
dmrgscf.settings.MPIPREFIX = ''

If you are using multiple nodes, add the following explicitly in this script:

dmrgscf.settings.BLOCKEXE = os.popen("which block2main").read().strip()
dmrgscf.settings.MPIPREFIX = 'mpirun -n <number of nodes> --bind-to none'

where <number of nodes> should be the actual number of nodes. --bind-to none cannot be omitted.

  1. Make sure lib.param.TMPDIR exists and it is an absolute path, which should be done at the beginning of your Python script or in your bash script:
import os
lib.param.TMPDIR = os.path.abspath(lib.param.TMPDIR)
if not os.path.exists(lib.param.TMPDIR):
        os.makedirs(lib.param.TMPDIR)

If you are running multiple jobs simultaneously, you also need to make sure each job has a distinct lib.param.TMPDIR.

  1. Set the mc.fcisolver.runtimeDir and mc.fcisolver.scratchDirectory to be the same folder explicitly in this script:
mc.fcisolver.runtimeDir = lib.param.TMPDIR
mc.fcisolver.scratchDirectory = lib.param.TMPDIR
  1. Set the number of threads per MPI processor using mc.fcisolver.threads:
mc.fcisolver.threads = 64

If the problem persists after the above is done, you may post your new revised script here, and see https://github.com/block-hczhai/block2-preview/issues/28#issuecomment-1368245655.

Note that this is probably the problem of the usage of subprocesses in pyscf/dmrgscf. This is not an issue of block2 (the error happens before block2 is actually invoked).