Efficient way to run DMRG-FIC-MRCISD?

1234zou commented 7 months ago

Hi, Huanchen (and other block2 developers),

I'm wondering what is the efficient way to run a DMRG-FIC-MRCISD job? I'm running a (16,16) job of a linear H16 chain, using the 3-21G basis set. Here is part of the input

dmrgscf.settings.MPIPREFIX = ''
nproc = 64
lib.num_threads(nproc)
...
mc = mcscf.CASCI(mf,16,(8,8))
mc.fcisolver = dmrgscf.DMRGCI(mol, maxM=1000)
mc.max_memory = 108000 # MB
mc.fcisolver.threads = nproc
mc.fcisolver.memory = 72 # GB
mc.verbose = 5
mc.kernel()

mm = WickICMRCISD(mc)
mm.kernel()

Here the OpenMP-parallelism is used. After Reading binary 4RDM from BLOCK is printed, the program runs using only 1 CPU. Is the code of this step not properly OpenMP-parallelized? Or maybe I should switch to MPI-parallelism for DMRG-FIC-MRCISD?

......production of RDMs took      63.24 sec
......reading the RDM took         26.83 sec
......production of RDMs took    1082.73 sec
Reading binary 4RDM from BLOCK

I note there is a remark in the online block2 documentation

The current FIC-MRCI / DMRG-FIC-MRCI implementation requires the explicit construction of the MRCI Hamiltonian, which is not practical for production runs.

I'm just curious that in the current situation whether there is any keyword to make the computation more efficient. Thanks a lot.

hczhai commented 7 months ago

Thanks for pointing out the issue. The slow speed you observed is caused by the python for loops in dmrgscf/dmrgci.py (not part of block2) https://github.com/pyscf/dmrgscf/blob/master/pyscf/dmrgscf/dmrgci.py#L650-L659.

To solve this problem, you can avoid using DMRGCI.unpackE4_BLOCK, by changing https://github.com/pyscf/dmrgscf/blob/master/pyscf/dmrgscf/dmrgci.py#L581 from

E4 = self.unpackE4_BLOCK(fname,norb)

to

E4 = numpy.fromfile(open(fname, 'rb'), offset=109, dtype=float).reshape((norb,) * 8).transpose(0, 1, 2, 3, 7, 6, 5, 4)

1234zou commented 7 months ago

Great! I've tested a (16,16) job but with only 2 virtual orbitals, here are the time results: before changing to numpy.fromfile,

......production of RDMs took      61.00 sec
......reading the RDM took         26.78 sec
......production of RDMs took    1149.12 sec
Reading binary 4RDM from BLOCK

WARN: AT LEAST, NO MORE bytes TO READ!

......reading the RDM took       2104.67 sec

HMAT basis size = 8689 thrds = 1e-10
HMAT symm error =    0.4239904438
E(MRCI) - E(ref) = -0.0006146624653027288 DC = -2.088212758789092e-07
E(WickICMRCISD)   = -7.720839854170862  E_corr_ci = -0.0006146624653027288
E(WickICMRCISD+Q) = -7.720840062992138  E_corr_ci = -0.0006148712865786078

after changing to numpy.fromfile,

CASCI E = -7.72022521094335  E(CI) = -17.7988454430940  S^2 = 0.0000000
......production of RDMs took      58.22 sec
......reading the RDM took         29.03 sec
......production of RDMs took    1036.11 sec
Reading binary 4RDM from BLOCK
......reading the RDM took          9.29 sec

HMAT basis size = 8689 thrds = 1e-10
HMAT symm error =    0.4580074300
E(MRCI) - E(ref) = -0.0006152311737706029 DC = -2.092545138599727e-07
E(WickICMRCISD)   = -7.720840442117125  E_corr_ci = -0.0006152311737706029
E(WickICMRCISD+Q) = -7.720840651371639  E_corr_ci = -0.0006154404282844629

It does save much time. However, the energy different between two jobs is 5.9e-7 a.u. Is this reasonable?

By the way, have you considered make a pull request to pyscf/dmrgscf for this modification? This is a small but important change.

hczhai commented 7 months ago

However, the energy different between two jobs is 5.9e-7 a.u. Is this reasonable?

This is more likely caused by the differences in the HF/CASCI step, or loose DMRG convergence threshold. In fact, the CASCI energy in your two runs differ by 2E-8.

By the way, have you considered make a pull request to pyscf/dmrgscf for this modification?

See https://github.com/pyscf/dmrgscf/pull/10.

1234zou commented 7 months ago

Your help is greatly appreciated. Problem solved.

block-hczhai / block2-preview

Efficient way to run DMRG-FIC-MRCISD? #80