Does DMRG-SC-NEVPT2 still work for PySCF+pyblock2?

block-hczhai / block2-preview

Efficient parallel quantum chemistry DMRG in MPO formalism

GNU General Public License v3.0

67 stars 23 forks source link

Does DMRG-SC-NEVPT2 still work for PySCF+pyblock2? #14

Closed 1234zou closed 1 year ago

1234zou commented 2 years ago

It seems to me that the way to use PySCF+Block-1.5 is almost the same as PySCF+pyblock2 for DMRG calculations (which I think is very convenient). I'm curious that whether the DMRG-SC-NEVPT2 with compressed MPS as perturber functions can be performed using PySCF+pyblock2?
Thanks a lot.

hczhai commented 2 years ago

Thanks for your interest in block2. DMRG-SC-NEVPT2 is currently not available in block2. Since recently many people have been interested in this feature, I will try to set it with a higher priority in my todo list. You may leave this issue open and I will keep you informed of any progress.

1234zou commented 2 years ago

That would be great!

lchaus commented 1 year ago

I saw the latest updates on the DMRGSCF documentation decided to give the new scripts using NEVPT2 a try (with the latest development version 0.5.1rc17 installed through pip on a "clean" conda environment with python 3.9.15 and latest pyscf and dmrgscf). I assume that this interface is still under development as it is not included yet in the stable version but FIY I had this error message while trying to run the second script (compressed_approx). (It seems somehow related to our previous discussion).

from pyscf import gto, scf, mcscf, mrpt, dmrgscf, lib
import os

dmrgscf.settings.BLOCKEXE = os.popen("which block2main").read().strip()
dmrgscf.settings.BLOCKEXE_COMPRESS_NEVPT = os.popen("which block2main").read().strip()
dmrgscf.settings.MPIPREFIX = ''

mol = gto.M(atom='O 0 0 0; O 0 0 1.207', basis='cc-pvdz', spin=2, verbose=4)
mf = scf.RHF(mol).run(conv_tol=1E-20)

mc = mcscf.CASSCF(mf, 6, 8)

mc.fcisolver = dmrgscf.DMRGCI(mol, maxM=500, tol=1E-10)
mc.fcisolver.runtimeDir = os.path.abspath(lib.param.TMPDIR)
mc.fcisolver.scratchDirectory = os.path.abspath(lib.param.TMPDIR)
mc.fcisolver.threads = 16
mc.fcisolver.memory = int(mol.max_memory / 1000) # mem in GB

mc.fcisolver.conv_tol = 1e-14
mc.canonicalization = True
mc.natorb = True
mc.run()

sc = mrpt.NEVPT(mc).compress_approx(maxM=200).run()

Traceback (most recent call last):
  File "test1.py", line 22, in <module>
    mc.run()
  File "/home/chaussy/.local/lib/python3.7/site-packages/pyscf/lib/misc.py", line 534, in run
    self.kernel(*args)
  File "/home/chaussy/.local/lib/python3.7/site-packages/pyscf/mcscf/mc1step.py", line 844, in kernel
    ci0=ci0, callback=callback, verbose=self.verbose)
  File "/home/chaussy/.local/lib/python3.7/site-packages/pyscf/mcscf/mc1step.py", line 352, in kernel
    e_tot, e_cas, fcivec = casscf.casci(mo, ci0, eris, log, locals())
  File "/home/chaussy/.local/lib/python3.7/site-packages/pyscf/mcscf/mc1step.py", line 865, in casci
    envs=envs)
  File "/home/chaussy/.local/lib/python3.7/site-packages/pyscf/mcscf/casci.py", line 580, in kernel
    ecore=energy_core)
  File "/home/chaussy/.local/lib/python3.7/site-packages/pyscf/dmrgscf/dmrgci.py", line 736, in kernel
    executeBLOCK(self)
  File "/home/chaussy/.local/lib/python3.7/site-packages/pyscf/dmrgscf/dmrgci.py", line 954, in executeBLOCK
    raise err
  File "/home/chaussy/.local/lib/python3.7/site-packages/pyscf/dmrgscf/dmrgci.py", line 949, in executeBLOCK
    check_call(cmd, cwd=DMRGCI.runtimeDir, shell=True)
  File "/usr/lib/python3.7/subprocess.py", line 347, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ' /home/chaussy/anaconda3/envs/block2_new/bin/block2main dmrg.conf > dmrg.out 2>&1' returned non-zero exit status 139.

hczhai commented 1 year ago

Hi @leochaussy, thanks for the information.

The error happens in line 22 which is before the NEVPT part so I guess this is not really related to the new SC-NEVPT interface. It should be an error during the DMRGSCF calculation. It is possibile that some new updates between the version 0.5.0 and 0.5.1rc17 cause the DMRGSCF to be broken. You can try running the same script except the last line using the two versions and see if any of the two works.

Before the error message you copied here, there should be another error message printed, which may contain some important information. You may also have a look at the dmrg.out file which can be found in lib.param.TMPDIR/dmrg.out to see if there are any error messages.

lchaus commented 1 year ago

I did miss a part which I find strange. After the Block Flags print in the output comes some segmentation fault. I don't see any difference with the previous version 0.5.0 in line 1 of the script. The weird part is that it starts the sweeps and then stops with the previous error midway. I join the output here. Also the dmrg.out output but it doesn't print much more error messages there.


/bin/sh: line 1: 26304 Segmentation fault      /home/chaussy/anaconda3/envs/block2_new/bin/block2main dmrg.conf > dmrg.out 2>&1

ERROR:  /home/chaussy/anaconda3/envs/block2_new/bin/block2main dmrg.conf > dmrg.out 2>&1

ERROR:  /home/chaussy/anaconda3/envs/block2_new/bin/block2main dmrg.conf > dmrg.out 2>&1

dmrg.txt test.txt

I did also a run with the 0.5.0, it goes fine up till the NEVPT2 part where it gives the following error (same thing in the dmrg.out):

CASCI E = -149.708657771223  E(CI) = -21.7432231772816  S^2 = 2.0000000
Use compressed mps perturber as an approximation
DMRG-NEVPT

ERROR:  /home/chaussy/anaconda3/envs/block2_last/bin/block2main dmrg.conf > dmrg.out 2>&1

Traceback (most recent call last):
  File "/home/chaussy/anaconda3/envs/block2_last/bin/block2main", line 60, in <module>
    dic = parse(fin)
  File "/home/chaussy/anaconda3/envs/block2_last/lib/python3.9/site-packages/pyblock2/driver/parser.py", line 224, in parse
    raise ValueError("Unrecognized keys (%s)" % diff)
ValueError: Unrecognized keys ({'restart_threepdm'})

I was assuming that at this point this option was not yet available. I could try with another commit in between those two maybe if you have one to suggest.

hczhai commented 1 year ago

Thanks for the report. So it seems that something went wrong from the 0.5.0 version to 0.5.1rc17 version.

I suggest that you try for the newest version, directly run block2main with a FCIDUMP but without using the DMRGSCF interface. If this does not work, one possible reason can be that the pip 0.5.0 version was built using MKL 2019, and for the newest 0.5.1rc17 it is based on MKL 2021.4. So when you switch the block2 versions, you also need to make sure that the MKL version is correct. But when you have a conda environment, conda can have its own MKL. pip install mkl==2021.4 will create another MKL. Then when there are multiple different MKL versions in the same system, some strange error can happen.

ldd $(python -c 'import block2; print(block2.__file__)') may reveal some information on how MKL is linked. I see that you also installed pyscf under ~/.local which is outside the conda environment. Not sure if you have another MKL inside ~/.local (which will be bad). Ideally, you may temporarily rename ~/.local to ~/.local-bak and try to install everything inside the conda environment to get a strictly clean environment. If in your supercomputer environment there is something like module load mkl then you should not load any mkl modules. All these are to make sure that block2 can see only a single well-defined version of MKL.

hczhai commented 1 year ago

Sorry, I can now reproduce the error you reported in Google colab. It should not be a problem from your environment. I will do some tests to see how I can fix it.

lchaus commented 1 year ago

Thanks, indeed, I did not notice that even after loading the conda environment it was still using the old install in .local sometimes. I really should use absolute path here to be safe. MKL seems OK by the way (no other one loaded and seems correctly linked to 2021.4) but yes the error persists thanks for looking into it.

hczhai commented 1 year ago

The error should be fixed in this commit: https://github.com/block-hczhai/block2-preview/commit/cfd0d968d832a5d440cb25955d4ef46815ae1d88. The problem is that I only tested the code with MPI enabled. But the error appears when the block2 without MPI is used. Thanks for identifying this issue and reporting the bug. It is great that we can fix this before it is released into the stable version.

You may either:

(a) Apply the patch in commit https://github.com/block-hczhai/block2-preview/commit/cfd0d968d832a5d440cb25955d4ef46815ae1d88 to block2main, then rerun the script. or (b) Wait for roughly six hours for the p0.5.1rc17 version to be updated, then you can force reinstall it using:

pip3 install block2==0.5.1rc17 --upgrade --force-reinstall --no-deps --no-cache-dir --extra-index-url=https://block-hczhai.github.io/block2-preview/pypi/

Also note that mpi4py is required for compress_approx via dmrgscf. And the permission of the file dmrgscf/nevpt_mpi.py should be correct. See the updated documentation for details.

lchaus commented 1 year ago

Thanks, I'll go and try that. One more thing, to come back to the topic of NEVPT2. If I want to perform a calculation on several states the best way is still to adapt what was specified in StackBlock's documentation ?

mc = dmrgscf.DMRGSCF(mf, 6, 6)
weights = [.5, .25, .25]
solver1 = dmrgscf.DMRGCI(mol)
solver1.scratchDirectory = '/scratch/solver1'
solver1.nroots = 1
solver1.wfnsym = 'a1g'
solver1.spin = 2  # nelec_alpha - nelec_beta
solver2 = dmrgscf.DMRGCI(mol)
solver2.scratchDirectory = '/scratch/solver2'
solver2.nroots = 2
solver2.wfnsym = 'a1u'
mcscf.state_average_mix_(mc, [solver1, solver2], weights)
mc.kernel()

mc.fcisolver = solver1
mrpt.NEVPT(mc, root=0).compress_approx(maxM=100).run()

mc.fcisolver = solver2
mrpt.NEVPT(mc, root=1).compress_approx(maxM=100).run()

And I have a small doubt on job restarting through dmrgscf, is the best way is to add ? mc.fcisolver.restart = True It seems redundant with the fullrestart keyword than can be added with extralines so I wondered if there was any difference. And the files needed are the entire MPS files plus the mps.info files right ?

hczhai commented 1 year ago

If I want to perform a calculation on several states the best way is still to adapt what was specified in StackBlock's documentation ?

I think the code you copied here should be okay for block2. This is the case when you have several roots but some of them are in different symmetry sectors (such as singlet and triplet), so you need multiple fcisolvers. If all your states have the same symmetry (point group and spin), you only need one fcisolver. I am not sure whether this answers your question.

It seems redundant with the fullrestart keyword than can be added with extralines so I wondered if there was any difference.

The two ways are actually different, as explained in https://github.com/pyscf/dmrgscf/issues/4. As in all scripts we discussed, there is no explicit usage of "restart" (the implementation itself knows when to add mc.fcisolver.restart = True so it is not the user's responsibility). Maybe you can tell me a concrete context / workflow when you think the explicit restart is needed (if your goal is different from the one in https://github.com/pyscf/dmrgscf/issues/4).

And the files needed are the entire MPS files plus the mps.info files right ?

For block2, you can delete all scratch files with the filename pattern *.PART.*, while keeping other files. Note that *.PART.* files will be significantly larger in size than other files, so deleting these should be sufficient if your purpose is to save storage space. In the newest version 0.5.1rc17 the *.PART.* files are automatically removed if the calculation ends normally.

lchaus commented 1 year ago

Ok nevpt2 test case is running fine now. I missed that thread on dmrgscf that mostly covers what I needed. Either it was for restarting with bigger M, or in the eventuality of a crashed calculation. Thanks for the tips.

lchaus commented 1 year ago

I'm trying to relaunch NEVPT2 following a converged multistate DMRGSCF calculation. In the case where there is only one root (as the example above for O2), the calculation is successful but I still run in some problems for multiple root solvers (in the DMRG_COMPRESS_NEVPT() call case).

I have this error with a minimal working example such as linked here (still using block2, not the mpi version with same specs) inp.txt


WARN: Mulitple states found in CASCI solver. First state is used to compute the Fock matrix and natural orbitals in active space.

CASCI state 0  E = -107.523722450474  E(CI) = -12.0080320587088  S^2 = 0.0000000
CASCI state 1  E = -107.060380920447  E(CI) = -11.5446905286819  S^2 = 0.0000000

WARN: DMRG executable file for nevptsolver is the same to the executable file for DMRG solver. If they are both compiled by MPI compilers, they may cause error or random results in DMRG-NEVPT calculation.

*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[skylake120.cluster:242261] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

ERROR:  /home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py /tmp/240890/nevpt_perturb_integral /home/lchaussy/anaconda3/envs/nevpt2/bin/block2main /tmp/240890 /tmp/240890

ERROR:  /home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py /tmp/240890/nevpt_perturb_integral /home/lchaussy/anaconda3/envs/nevpt2/bin/block2main /tmp/240890 /tmp/240890
Traceback (most recent call last):
  File "/scratch/lchaussy/nevpt2/restart.py", line 60, in <module>
    mrpt.NEVPT(mc, root=0).compress_approx(maxM=50).run()
  File "/home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/lib/misc.py", line 534, in run
    self.kernel(*args)
  File "/home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/mrpt/nevpt2.py", line 781, in kernel
    perturb_file = DMRG_COMPRESS_NEVPT(self, maxM=self.maxM, root=self.root,
  File "/home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py", line 250, in DMRG_COMPRESS_NEVPT
    raise err
  File "/home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py", line 247, in DMRG_COMPRESS_NEVPT
    subprocess.check_call(cmd, shell=True)
  File "/home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/subprocess.py", line 373, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ' /home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py /tmp/240890/nevpt_perturb_integral /home/lchaussy/anaconda3/envs/nevpt2/bin/block2main /tmp/240890 /tmp/240890' returned non-zero exit status 1.

First it seemed to me in my real example that I might have to specify the scratch folders differently in the multiple root case. As if "nevpt_perturb_integral" and "nevpt2_0" folder wrong location might be the cause for this failed command. But I haven't found a workaround and the fact that it also fails on this minimal example in all cases with that MPI warning makes me think I might have missed something else.

lchaus commented 1 year ago

Sorry it seems I spoke too fast, I fixed this problem on my minimal test example. Leaving only mc.fcisolver.runtimeDir and removing mc.fcisolver.scratchDirectory (they were pointing to same location) and resorting to default scratch assignation seemed to solve the issue. I'm still waiting for the real system result which is rather big.

hczhai commented 1 year ago

Hi @leochaussy,

In the above output for the minimal test, I see some scratch folders were pointed to /tmp/240890. This is the default in pyscf but it is not good. I suggest you explicitly set both mc.fcisolver.runtimeDir and mc.fcisolver.scratchDirectory to the same directory (in the real scratch space such as /scratch/..., but not /tmp) for real systems. If you think this setting causes the problem, we can discuss.

If you skip one of mc.fcisolver.runtimeDir and mc.fcisolver.scratchDirectory, and they end up with the default/tmp/<process id> then this will slow down the IO speed and significantly affect the performance. Note that (in general) in a supercomputer the /tmp directory only has very limited space, and it is shared by all users, and it does not provide any high IO speed. ./tmp (or any other relative path) may be slightly better than /tmp if you do not have other choice.

I see no error running your script in Google colab (with explicitly setting both mc.fcisolver.runtimeDir and mc.fcisolver.scratchDirectory to ./tmp): https://colab.research.google.com/drive/17j2bwNEhwBAVQJjSLS0Pumd2pswBLQw1?usp=sharing

lchaus commented 1 year ago

Yes I tried to go a little too fast and as I was performing test on CLI and this error kept popping no matter what and I left this being redirected to wrong temp. But my objective is indeed to use the fast scratch available on a HPC, and that's where I keep failing. I used to set mc.fcisolver.runtimeDir and mc.fcisolver.scratchDirectory to a subfolder of where I run my main script.

The previous input runs fine if I proceed that way. I first assumed that it might be related to the mc.fcisolver.scratchDirectory as I tried to remove it and it was the most notable difference between my "real" script and the minimal working example that ended up being successful but it seems I'm not there yet.

The last test keeps failing likely at the same spot although for some reason I'm not getting a lot of info (using same submit script so all should be ok on that end).

Here is the error this time:

CASCI state 10  E = -1859.25648010351  E(CI) = -90.7207406698467  S^2 = 2.0000000
Use compressed mps perturber as an approximation
DMRG-NEVPT
......production of RDMs took    1766.63 sec
......reading the RDM took         87.70 sec

WARN: DMRG executable file for nevptsolver is the same to the executable file for DMRG solver. If they are both compiled by MPI compilers, they may cause error or random results in DMRG-NEVPT calculation.

Canonicalize orbitals for root 0

ERROR:  /home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py /scratchfast/lchaussy/8384274/tmp/nevpt_perturb_integral /home/lchaussy/anaconda3/envs/nevpt2/bin/block2main /scratchfast/lchaussy/8384274/tmp /scratchfast/lchaussy/8384274/tmp

input.txt

This command "cmd" in nevpt2_mpi.py (line 239) needs a dmrg_scratch and a nevpt_scratch, could it be that the two pointing at the same location is what's causing error ? I have a nevpt2_0/nodex folder in my runtime directory which is being written prior to the job's death at the following location 8384274/tmp/nevpt2_0/nodex I couldn't figure if the adressing was correctly understood by block2 or not (and dmrg.out files are either finished without error or empty for the one in 8384274/tmp/nevpt2_0

On an unrelated note, when restarting a CASCI calculation, do I need to do something about this warning (which also pops in the minimal example of before) or is it irrelevant. I think it's coming from a call to canonicalize so if it concerns only the non-active part that should not matter but I wasn't sure it is indeed this. WARN: Mulitple states found in CASCI solver. First state is used to compute the Fock matrix and natural orbitals in active space.

hczhai commented 1 year ago

When you say I was performing test on CLI are you creating an interactive job in HPC or you are running in the login node? If you are using an interactive job then there is an additional caveat for running the NEVPT script.

Let's say you are currently in the login node. You use some SLURM command to get into a computational node. Then to test NEVPT, the correct way is keep this allocated computational node open, make a note on the node name, then open another terminal, then ssh <node name>, then in this new terminal you can test the NEVPT script.

If you submit a job for testing, there is no such subtleties.

hczhai commented 1 year ago

When you see ERROR: /home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py /scratchfast/lchaussy/8384274/tmp/nevpt_perturb_integral /home/lchaussy/anaconda3/envs/nevpt2/bin/block2main /scratchfast/lchaussy/8384274/tmp /scratchfast/lchaussy/8384274/tmp, you can manually run the command shown after the "ERROR" to see why it cannot run. When you manually run it, you may run it in the login node or a manually ssh'd interactive node, but not directly in the interactive node.

lchaus commented 1 year ago

When you say I was performing test on CLI are you creating an interactive job in HPC or you are running in the login node? If you are using an interactive job then there is an additional caveat for running the NEVPT script.

Let's say you are currently in the login node. You use some SLURM command to get into a computational node. Then to test NEVPT, the correct way is keep this allocated computational node open, make a note on the node name, then open another terminal, then ssh <node name>, then in this new terminal you can test the NEVPT script.

If you submit a job for testing, there is no such subtleties.

When I was performing that test on CLI (on the minimal example) I was using a interactive job using srun. We cannot use ssh tunnel directly to the nodes on this particular cluster but I'll remember that caveat for next time thanks. (I can retry on another cluster were I can but in the end it is not probably not suited for this type of big production runs) All the other examples I refered to were through the use of an sbatch script from the login node.

I tried to run the faulty command manually this time from login node and was able to get more info.

Traceback (most recent call last):
  File "/home/lchaussy/anaconda3/envs/nevpt2/bin/block2main", line 622, in <module>
    init_memory(isize=int(memory * 0.1),
MemoryError: std::bad_alloc
Traceback (most recent call last):
  File "/home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py", line 509, in <module>
    nevpt_integral_mpi(sys.argv[1],sys.argv[2],sys.argv[3],sys.argv[4])
  File "/home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py", line 306, in nevpt_integral_mpi
    f = open(os.path.join(nevpt_scratch_mpi, 'node0', 'Va_%d'%root), 'r')
FileNotFoundError: [Errno 2] No such file or directory: '/scratch/lchaussy/nevpt_last/tmp/nevpt2_0/node0/Va_0'

Not sure if this is a memory error from running on the really limited login node or there is really some issue with the missing file. I'll retry on the cluster where I can run directly on the node through ssh tomorrow.

hczhai commented 1 year ago

Hi @leochaussy, thanks for the feedback.

We cannot use ssh tunnel directly to the nodes on this particular cluster

In this case, you can try whether you can get an interactive node using the salloc SLURM command. If you get the node from salloc instead of srun, it should be "usable" and you will not need to ssh to that node. Note that this is only the problem caused by the usage of mpi4py inside dmrgscf/nevpt_mpi.py (when srun is used, it is nested MPI which is not allowed). It is not a restriction of using block2 (non-MPI version) itself.

Not sure if this is a memory error from running on the really limited login node

Yes I think this is because in the login node you have limited memory. You may copy that command, and write a batch job script including that single command, and submit that job to test (if salloc is not a valid command in your system).

lchaus commented 1 year ago

I did several tests. One through a batch job script restarting the whole job with high memory quantity to try to rule out this potential issue (1TB although the system is small in size with 6 atoms, basis set is rather large, ~ 300 basis functions). This and only running the problematic command lead to same output as previous (minus the memory part). To be sure I tried to monitor memory consumption with the limited means I know on that node (seff and top), it doesn't seem crazy on average (maybe 5%-6% of required memory overall) although I could have missed a spike at the end. So I decided to move where I can have a better view of what is happening.

I reruned the faulty command directly our own cluster as you specified in the earlier post (running on one side srun, and going in the node with ssh tunnel in another terminal to launch the script in the dedicated scratch). I can really monitor that node as it's our in house cluster although it is limited to 200GB of memory it shouldn't be an issue here. While rerunning the whole job, I didn't see any spike in memory use either (here we have graphed data on pretty much everything so it really doesn't seem to be an issue.

Using the nevpt_mpi.py script in the failed job directory leads to this similar output.

(nevpt2) /home/chaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py ./tmp/nevpt_perturb_integral /home/chaussy/anaconda3/envs/nevpt2/bin/block2main ./tmp/ ./tmp/

/bin/sh: line 1: 60032 Segmentation fault      /home/chaussy/anaconda3/envs/nevpt2/bin/block2main /scratch2/nevpt2/tmp/nevpt2_0/dmrg.conf > /scratch2/nevpt2/tmp/nevpt2_0/dmrg.out
Traceback (most recent call last):
  File "/home/chaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py", line 509, in <module>
    nevpt_integral_mpi(sys.argv[1],sys.argv[2],sys.argv[3],sys.argv[4])
  File "/home/chaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/dmrgscf/nevpt_mpi.py", line 306, in nevpt_integral_mpi
    f = open(os.path.join(nevpt_scratch_mpi, 'node0', 'Va_%d'%root), 'r')
FileNotFoundError: [Errno 2] No such file or directory: '/scratch2/nevpt2/tmp/nevpt2_0/node0/Va_0'

hczhai commented 1 year ago

Thanks for the detailed feedback.

From your output I can see that when you run the (A) main script like python main_script.py, it tries to run the command (B) .../nevpt_mpi.py <some parameters> and this command failed. Then to understand why it failed, you manually run .../nevpt_mpi.py <some parameters> and it shows that it tries to run the command (C) /home/chaussy/anaconda3/envs/nevpt2/bin/block2main /scratch2/nevpt2/tmp/nevpt2_0/dmrg.conf > /scratch2/nevpt2/tmp/nevpt2_0/dmrg.out and this command failed (because of a segment fault). The error FileNotFoundError you see is because, only when command (C) is successful, you will have the Va_0 folder. So the FileNotFoundError is not the core issue, the core problem now is why command (C) failed, or what happened before the segfault in (C).

In short, the chain is that (A) calls (B) and (B) calls (C). So I would like to see what happens if you manually run command (C), which is /home/chaussy/anaconda3/envs/nevpt2/bin/block2main /scratch2/nevpt2/tmp/nevpt2_0/dmrg.conf, here you can skip the > dmrg.out part so that the output will be printed on the screen. According to your output, we expect a Segmentation fault from this command, but as you know, block2main is a python script so I think it will at least first print something (when you manually run command (C)) then die. So this is the missing information I want to see. Thanks for your patience on debugging this.

lchaus commented 1 year ago

Here is the full output of command (c). Indeed I missed that all my test case had no singlet embedding option. Is it supposed to be supported with NEVPT2 ?

********************************** INPUT START **********************************
nelec                                                           18
spin                                                             2
twodot_to_onedot                                                 4
orbitals                                                   FCIDUMP
maxiter                                                          6
sweep_tol                                               1.0000e-07
outputlevel                                                      2
hf_occ                                                    integral
nroots                                                          11
weights                   0.090909 0.090909 0.090909 0.090909 0.090909 0.090909 0.090909 0.090909 0.090909 0.090909 0.090909
num_thrds                                                        4
fullrestart                                                       
nevpt_state_num                                                  0
restart_mps_nevpt                                        20 18 200
mem                                                          200 g
schedule                  Sweep   0-   3 : Mmps =    50 Noise =    0.0001 DavTol =    0.0001
                          Sweep   4-   5 : Mmps =    50 Noise =         0 DavTol =     1e-07
irrep                                                            1
********************************** INPUT END   **********************************

SPIN ADAPTED - REAL DOMAIN - DOUBLE PREC
qc mpo type =  QCTypes.Conventional
 UseMainStack = 0 MinDiskUsage = 1 MinMemUsage = 0 IBuf = 0 OBuf = 0
 FPCompression: prec = 1.00e-16 chunk = 1024
 IMain = 0 B / 13.0 GB DMain = 0 B / 67.1 GB ISeco = 0 B / 5.59 GB DSeco = 0 B / 101 GB
 OpenMP = 1 TBB = 0 MKL = GNU 2021.0.4 SeqType = Tasked MKLIntLen = 4
 THREADING = 2 layers : Global | Operator BatchedGEMM 
 NUMBER : Global = 4 Operator = 4 Quanta = 0 MKL = 1
 COMPLEX = 1 SINGLE-PREC = 1 KSYMM = 0
loading reorder for restarting =  [ 5 10  8  6 17 12 19  3  9  7  4 18 11 15 16  0 13  2  1 14]
read integral finished 25.155853416770697
integral sym error =            0
MinMPOMemUsage =  False
full restart
MPS =  JRRRRRRRRRRRRRRRRRRR 0 2 < N=0 S=0 PG=0 >
GS INIT MPS BOND DIMS =       1     4    15    47   100   100   100   100   100   100   100   100   100   100   100   100   100   100   100    33    11
build mpo start ...
build mpo finished ... Tread = 0.000 Twrite = 0.000 T = 0.032
simpl mpo start ...
simpl mpo finished ... Tread = 0.000 Twrite = 0.000 T = 0.149
GS MPO BOND DIMS =      25    32    43    58    77   100   127   158   193   232   193   158   127   100    77    58    43    32    25     1
build 1pdm mpo 25.52106775715947
build identity mpo 25.523135118186474
para mpo finished 25.523914009332657
----- root =   0 /  11 -----
transform ref state to singlet embedding ...
Segmentation fault

hczhai commented 1 year ago

Hi @leochaussy,

Thanks for providing the complete output. Now I see where the problem is. It seems that in your previous reply where you provided the "input.txt" https://github.com/block-hczhai/block2-preview/issues/14#issuecomment-1377507904, you add a line mc.fcisolver.extraline = [ "singlet_embedding" ] (since in this file there is no geometry information I cannot directly debug this file). But in the documentation ("the minimal test") https://block2.readthedocs.io/en/latest/user/dmrg-scf.html#dmrg-sc-nevpt2, there is no such line (although the molecule is actually a triplet O2). If you remove this extra line of "singlet_embedding", everything will work. In fact, for the "minimal test" shown in the documentation, if you add the line mc.fcisolver.extraline = [ "singlet_embedding" ], this example will fail. So this is the key (and the solution) of the problem.

The following is some more details.

(1) The segfault happens after the line transform ref state to singlet embedding which means, in general the program assumes that the user did not use singlet embedding in the CASSCF calculation. But for the MPS compression, we have to use singlet embedding (just for the compression step). In this step block2 automatically transforms the MPS to the singlet embedding format. The user does not need to take care of this.

Now the new situation is, if the user for some reason decided to use the "singlet embedding" keyword already in the CASSCF step, then we do not need to do this transform of MPS. block2 tries to look up whether there is singlet_embedding keyword in the dmrg.conf for this nevpt2 step (you can find this dmrg.conf under nevpt2_0 folder). If there is the singlet_embedding keyword, this transform step will be skipped, and everything will still be fine.

Now for some reason, in the nevpt2_0/dmrg.conf file, the extra keyword singlet_embedding is lost. Then block2 tries to transform an MPS to SE which is already a SE, and this causes the segfault. Why the keyword is lost? If you have a look at https://github.com/pyscf/dmrgscf/blob/master/pyscf/dmrgscf/nevpt_mpi.py#L203 You can see that in this implementation (note that I was not an author of pyscf/dmrgscf and we have to get used to what was already there) the nevpt2 fcisolver only inherits the mc.fcisolver.block_extra_keyword, but not mc.fcisolver.extraline. These two attributes actually have roughly the same function in most cases, but they act differently only here in nevpt_mpi.py. So only keywords added through block_extra_keyword can propagate to the nevpt2 solver.

In short, if you change mc.fcisolver.extraline = [ "singlet_embedding" ] to mc.fcisolver.block_extra_keyword = [ "singlet_embedding" ], everything will work, as you expected, with singlet embedding enabled for both CASSCF and NEVPT2. So the answer to your question "is it supposed to be supported with NEVPT2 ?" is yes.

(2) Then the question is that whether the "singlet_embedding" is actually needed for the CASSCF step. For single roots it is recommended to add this keyword make the MPS variational. But if you have lots of roots, it is not very necessary because for a state-averaged MPS, no matter you use singlet embedding or not, it will be non-variational anyway. Of course, if you use singlet embedding for a state-averaged MPS, it will at least not make things worse.

(3) Thanks for all your efforts for investigating this issue. I will add a note in the block2 documentation for the proper way to add "singlet_embedding" for the NEVPT2 calculation.

Please let me know if you find any further problems.

hczhai commented 1 year ago

Also note that when you test these keywords, before every test, remember to clean the scratch folder. Otherwise, after a successful run, the file /scratch2/nevpt2/tmp/nevpt2_0/node0/Va_0 will be there and if a subsequent wrong calculation does not generate this file, it will not produce FileNotFoundError (but it should).

lchaus commented 1 year ago

Thanks for the very detailed answer and the help. This clarifies a lot of things. A simple restart using the same script and mc.fcisolver.block_extra_keyword = [ "singlet_embedding" ] on a clean scratch solves the issue !

lchaus commented 1 year ago

Hello, It is probably not worth opening a new issue so I pursue here with a question.

I've been trying to use DMRG-SC-NEVPT2 with explicit 4pdm using a multiple root solver. If I naively rework the example from the documentation to accomodate more than one root by simply adding mc.fcisolver.nroots = 2 and assigning a given root to the mrpt.NEVPT() object I run into some trouble.

from pyscf import gto, scf, mcscf, mrpt, dmrgscf, lib
import os

dmrgscf.settings.BLOCKEXE = os.popen("which block2main").read().strip()
dmrgscf.settings.MPIPREFIX = ''

mol = gto.M(atom='O 0 0 0; O 0 0 1.207', basis='cc-pvdz', spin=2, verbose=4)
mf = scf.RHF(mol).run(conv_tol=1E-20)

mc = mcscf.CASCI(mf, 6, 8)

mc.fcisolver = dmrgscf.DMRGCI(mol, maxM=500, tol=1E-10)
mc.fcisolver.runtimeDir = os.path.abspath(lib.param.TMPDIR)
mc.fcisolver.scratchDirectory = os.path.abspath(lib.param.TMPDIR)
mc.fcisolver.threads = 8
mc.fcisolver.nroots = 2
mc.fcisolver.memory = int(mol.max_memory / 1000) # mem in GB

mc.fcisolver.conv_tol = 1e-14
mc.natorb = True
mc.canonicalization = True
mc.run()

sc = mrpt.NEVPT(mc, root=1).run()

Here with the error:

<CAS-nat-orb|mo-hf>  5  5    1.00000000
<CAS-nat-orb|mo-hf>  6  6    0.43418070
<CAS-nat-orb|mo-hf>  6  7    0.90082580
<CAS-nat-orb|mo-hf>  7  6    0.90082580
<CAS-nat-orb|mo-hf>  7  7   -0.43418070
<CAS-nat-orb|mo-hf>  8  8    0.90535864
<CAS-nat-orb|mo-hf>  8  9   -0.42464777
<CAS-nat-orb|mo-hf>  9  8    0.42464777
<CAS-nat-orb|mo-hf>  9  9    0.90535864
<CAS-nat-orb|mo-hf>  10  10    1.00000000
Traceback (most recent call last):
  File "/scratch/lchaussy/pyscf_new_cuo2nh3/nevpt2/test_regular/mwe/mwe_2/retest/nevpt2.py", line 24, in <module>
    sc = mrpt.NEVPT(mc, root=0).run()
  File "/home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/lib/misc.py", line 534, in run
    self.kernel(*args)
  File "/home/lchaussy/anaconda3/envs/nevpt2/lib/python3.9/site-packages/pyscf/mrpt/nevpt2.py", line 742, in kernel
    self.ci[self.root] = single_ci_vec
TypeError: 'range' object does not support item assignment

Looking at where this failed; if I understand correctly it should not be entering this if since we use mc.canonicalization = True previously (I guess this is why everything is OK in the compressed approach), but I maybe doing this all wrong.


        #By defaut, _mc is canonicalized for the first root.
        #For SC-NEVPT based on compressed MPS perturber functions, _mc was already canonicalized.
        if (not self.canonicalized):
            # Need to assign roots differently if we have more than one root
            # See issue #1081 (https://github.com/pyscf/pyscf/issues/1081) for more details
            self.mo_coeff, single_ci_vec, self.mo_energy = self.canonicalize(
                self.mo_coeff, ci=self.load_ci(), cas_natorb=True, verbose=self.verbose)
            if self.fcisolver.nroots == 1:
                self.ci = single_ci_vec
            else:
                self.ci[self.root] = single_ci_vec

Last thing, on a perhaps unrelated note I am puzzled by the following warning issued shortly before:

WARN: Mulitple states found in CASCI solver. First state is used to compute the Fock matrix and natural orbitals in active space.

I'm not sure I understand this right. If this does come from the canonicalization step performed after the last CASCI, then canonical orbitals should be obtained by directly calling the canonicalize() function with the state average 1RDM instead (for later use in several consecutive NEVPT2 calculations for different roots this would be an issue and lead to inconsistencies if this is just done once).

hczhai commented 1 year ago

Thanks for pointing out the issue. This is a known problem and I have to say that currently the SC-NEVPT2 with 4pdm with more than 1 root is not officially supported. The problem is not related to block2, however. I think the problem was introduced in pyscf due to this commit: https://github.com/pyscf/pyscf/pull/1090 and after they changed the interface, they only tested with the FCI solver but not the DMRG solver. If you use a pyscf version prior to that commit, it may work.

lchaus commented 1 year ago

Thanks for your help. Reverting to the previous if block before this commit #https://github.com/pyscf/pyscf/pull/1090 does remove this problem.

        if (not self.canonicalized):
            self.mo_coeff, single_ci_vec, self.mo_energy = self.canonicalize(
                self.mo_coeff, ci=self.load_ci(), cas_natorb=True, verbose=self.verbose)

In that case this should be OK I guess if I take care not overwriting by launching one root at the time.

I will perform additional tests with Molpro or ORCA to see if results are consistent and keep you informed.

On another note I am getting some very strange results with the compressed algorithm and I was wondering if that could come from a wrong use of canonicalize (since I have the following warning) :

WARN: Mulitple states found in CASCI solver. First state is used to compute the Fock matrix and natural orbitals in active space.

Differential correlations effects seem OK between ground state and low-lying roots and the energies are very comparable to what I get with another method (RASPT2) but it is way off for one that should in principle be well defined. My first guess was that this is either related to intruder states problem due to not using the 4RDM or a too small bond dimension for the compressed MPS but reading about this issue made me doubt about using this kind of input script with only one solver, I will try to also find a relevant use case to test for that because calculations are rather expensive in that case.

    mc.fcisolver = dmrgscf.DMRGCI(mol, maxM=2000)
    mc.fcisolver.runtimeDir = os.path.abspath(lib.param.TMPDIR)
    mc.fcisolver.scratchDirectory = os.path.abspath(lib.param.TMPDIR)
    mc.fcisolver.block_extra_keyword = [ "singlet_embedding" ]
    mc.fcisolver.threads = 32
    mc.fcisolver.nroots = 6
    mc.fcisolver.memory = int(1200) 

    mc.canonicalization = True
    mc.natorb = True
    mc.kernel(actorbs)

    mrpt.NEVPT(mc, root=0).compress_approx(maxM=500).run()
    mrpt.NEVPT(mc, root=1).compress_approx(maxM=500).run()
    mrpt.NEVPT(mc, root=2).compress_approx(maxM=500).run()
    mrpt.NEVPT(mc, root=3).compress_approx(maxM=500).run()
    mrpt.NEVPT(mc, root=4).compress_approx(maxM=500).run()
    mrpt.NEVPT(mc, root=5).compress_approx(maxM=500).run()

hczhai commented 1 year ago

I have updated the block2 documentation with an example of DMRG-SC-NEVPT2 with multiple states: https://block2.readthedocs.io/en/latest/user/dmrg-scf.html#dmrg-sc-nevpt2-multi-state. This script should be okay with the most recent version of pyscf. Basically, you have to circumvent the problem in the canonicalization process in pyscf by manually do canonicalization for each state and then set canonicalized=True. The manual canonicalization step works for both the compression approach and the explicit 4PDM approach.

Note that you can always benchmark the results using the FCI active space solver in pyscf. The (implicit) canonicalization process in pyscf is only broken for DMRG but not for FCI.

lchaus commented 1 year ago

Thank you for this efficient workaround. It's very helpful. I compared with both pyscf FCI solver and molpro, this does indeed lead to poor results for higher excited states in the 4PDM case if you don't canonicalize separately. This doesn't seem to affect in the compress_approx() case though (and my other problem is probably not related to that). With separate canonicalization everyone agrees well below 10e-5 Hartree !

I have another unrelated and perhaps naive question as I'm not super familiar with NEVPT2. Nowhere in pyscf I see the possibility of using the frozen core approximation for MRPT. In the CASPT2 case this can be dangerous in several ways. A part from the DMRG-SC-NEVPT2 paper (JCTC 2016, 12, 4) I didn't see mention of this anywhere so I'm assuming that this is by design and usually not an issue but I was hoping to confirm that.

hczhai commented 1 year ago

Hi @leochaussy, this is a good question. Unfortunately, I do not have much experience with NEVPT. Currently I am not aware of any problem of not using frozen core in NEVPT.

hczhai commented 1 year ago

Hi @1234zou, thanks for adding the support of block2 in your MOKIT package. The DMRG-SC-NEVPT2 is now added in the stable version block2==0.5.1.

Compared to the development version 0.5.1rc17, the memory usage of 3PDM is greatly reduced, which may be critical for large active space calculations.

You may close the issue if there are no further questions.

1234zou commented 1 year ago

Thanks. I've tried a DMRG-SC-NEVPT2 calculation based on DMRG-CASCI(16,16) reference, using block2==0.5.1. Everything is perfectly supported. Issue closed.