block-hczhai / block2-preview

Efficient parallel quantum chemistry DMRG in MPO formalism
GNU General Public License v3.0
67 stars 23 forks source link

Is -DLARGE_BOND=ON always needed? #81

Closed 1234zou closed 6 months ago

1234zou commented 7 months ago

Dear Huanchen,

I've performed a DMRG-FIC-MRCISD calculation for the linear H16 chain, using the 3-21G basis set. This system has 0 core orbitals, 16 active orbitals, and 16 virtual orbitals, the obtained correlation energy is zero

CASCI E = -8.08398584715570  E(CI) = -18.1626060793063  S^2 = 0.0000000
......production of RDMs took      63.43 sec
......reading the RDM took         28.87 sec
......production of RDMs took    1157.67 sec
Reading binary 4RDM from BLOCK
......reading the RDM took          9.99 sec

HMAT basis size = 12957 thrds = 1e-10
HMAT symm error =    0.0000000000
E(MRCI) - E(ref) = 0 DC = 0
E(WickICMRCISD)   = -8.083985847155699  E_corr_ci = 0
E(WickICMRCISD+Q) = -8.083985847155699  E_corr_ci = 0

However, I calculated a smaller system using only 2 virtual orbitals, the result seems normal

CASCI E = -7.72022521094335  E(CI) = -17.7988454430940  S^2 = 0.0000000
......production of RDMs took      58.22 sec
......reading the RDM took         29.03 sec
......production of RDMs took    1036.11 sec
Reading binary 4RDM from BLOCK
......reading the RDM took          9.29 sec

HMAT basis size = 8689 thrds = 1e-10
HMAT symm error =    0.4580074300
E(MRCI) - E(ref) = -0.0006152311737706029 DC = -2.092545138599727e-07
E(WickICMRCISD)   = -7.720840442117125  E_corr_ci = -0.0006152311737706029
E(WickICMRCISD+Q) = -7.720840651371639  E_corr_ci = -0.0006154404282844629

My guess is that I did not add -DLARGE_BOND=ON when compiling block2 so that some integers are beyond the range. Is there any criterion to determine/judge that -DLARGE_BOND=ON is needed for large active space like (16,16)?

Best, Jingxiang

hczhai commented 7 months ago

Hi, thanks for reporting the problem.

For the case when you get zero correlation energies, please provide (1) the complete Python script and (2) the complete output file with at least verbose = 4 for CASCI and verbose = 5 for ICMRCISD (as what is done in the documentation example).

1234zou commented 7 months ago

OK. I re-submit the job with mol.verbose=5, files are attached DMRG_FIC_MRCISD_0_corr.tar.gz

hczhai commented 7 months ago

Thanks for providing the files. I tested your Python script but got different outputs for the MRCISD part. As MRCISD was purely implemented using numpy, I think this should be numpy/Python environment related problems (rather than a problem of block2 compiling options). Part of my output is

CASCI E = -8.08398583892634  E(CI) = -18.162606071077  S^2 = 0.0000000
......production of RDMs took      59.26 sec
......reading the RDM took         23.70 sec
......production of RDMs took     803.10 sec
Reading binary 4RDM from BLOCK
......reading the RDM took        192.19 sec

diag overlap ref-ref-* size = 1
diag overlap ijrskltu* size = 0
diag overlap rsiatukp* size = 0
diag overlap ijrakltp* size = 0
diag overlap rsabtupq* size = 65536
diag overlap ijabklpq* size = 0
diag overlap irabktpq1 size = 0
diag overlap rabctpqg* size = 65536
diag overlap iabckpqg* size = 0
HMAT basis size = 98177 thrds = 1e-10
...
HMAT symm error =    0.9084697965

Notes:

  1. My HMAT basis size is different from yours. The last step requires the full diagonalization of a 98177 x 98177 matrix (in the current implementation), which requires more than 500 GB memory so I did not get a correlation energy number (but it is very likely to be non-zero). I used Python 3.8.5 and numpy==1.24.4.
  2. When ncore == 0, the HMAT basis size number should be very close to 0.5 * nvirt ** 2 * nact ** 2 + nvirt * nact ** 3. For this case, we have 256 * 256 / 2 + 16 * 4096 = 98304 which is close to 98177 in my output but far away from 12957 in your output. The relevant code is https://github.com/block-hczhai/block2-preview/blob/master/pyblock2/icmr/icmrcisd_full.py#L193-L198. Here I provide some data (based on your input) for you to check:
# diag overlap rsabtupq* size = 65536
assert abs(np.linalg.norm(xn) - 485.73353755963177) < 1E-5
assert abs(np.linalg.norm(umats[key]) - 3300.4487602047843) < 1E-5
assert umats[key].shape == (65536, 32896)

# diag overlap rabctpqg* size = 65536
assert abs(np.linalg.norm(xn) - 558.281417562479) < 1E-5
assert abs(np.linalg.norm(umats[key]) - 53532.995978565734) < 1E-5
assert umats[key].shape == (65536, 65280)

Note that np.linalg.norm(umats[key]) may be numpy implementation dependent and sensitive to small numerical errors. But the order of magnitude should roughly be the same. Please see if you can get a good number using a more recent numpy version (or perform the calculation with another environment).

1234zou commented 7 months ago

You said that "which requires more than 500 GB memory". But my computational node only has around 180GB memory (or only 180GB is allowed in my previous slurm script). So I change to a very large node (with 1.4 TB total memory allowed), and modify the python code in icmrcisd_full.py as follows

        lib.logger.debug(ic, 'diag overlap %s size = %d', key, len(xn))
        w, v = np.linalg.eigh(xn)
        idx = w > ic.mrci_thrds
        umats[key] = v[:, idx] * (w[idx] ** (-0.5))
        ntr += umats[key].shape[1]  ### nothing changed until here
        print(np.linalg.norm(xn))
        print(np.linalg.norm(umats[key]))
        print(umats[key].shape)

Here are the input and output files of this job (again, the correlation energy is zero). Python 3.9.13 and numpy==1.21.5 are used. I guess that the printed data in output may be helpful for you. DMRG_FIC_MRCISD_large_mem.tar.gz

A computation using more recent numpy version are still running.

hczhai commented 7 months ago

Thanks for the feedback.

You said that "which requires more than 500 GB memory". But my computational node only has around 180GB memory (or only 180GB is allowed in my previous slurm script).

What I said was "The last step requires the full diagonalization of a 98177 x 98177 matrix ... which requires more than 500 GB memory" In your output, the number in "HMAT basis size = 12957" was wrong, so you did not get the 98177 x 98177 matrix with the correct shape. Let's first solve the problem in the wrong "HMAT basis size" then one should worry about the memory requirement.

Your new output showed that the the norm of the overlap matrix xn is probably correct, but umats[key] should never be all zero. This seems to be a problem in numpy. Please let me know what happens after you update the numpy version.

1234zou commented 7 months ago

I create a new virtual environment with python==3.9.18 and numpy==1.24.3. There is no numpy==1.24.4 for me to choose in this virtual environment via conda install. The array dimensions are still wrong, but with a slightly change

diag overlap ref-ref-* size = 1
1.0
1.0
(1, 1)
diag overlap ijrskltu* size = 0
0.0
0.0
(0, 0)
diag overlap rsiatukp* size = 0
0.0
0.0
(0, 0)
diag overlap ijrakltp* size = 0
0.0
0.0
(0, 0)
diag overlap rsabtupq* size = 65536
485.73658404811664
0.0
(65536, 0)
diag overlap ijabklpq* size = 0
0.0
0.0
(0, 0)
diag overlap irabktpq1 size = 0
0.0
0.0
(0, 0)
diag overlap rabctpqg* size = 65536
558.2859097711624
0.0
(65536, 12960)
diag overlap iabckpqg* size = 0
0.0
0.0
(0, 0)
HMAT basis size = 12961 thrds = 1e-10
hamil eq (ref-ref-*, ref-ref-*)

This time, the job is abnormally terminated without any error information. Here are the input and output files of this job py3_9_18_np1_24_3.tar.gz

I'll try other versions of python and numpy.

hczhai commented 7 months ago

Thanks very much for the help with the testing.

In order to simplify the checking a little bit, could you run the following python script in your environment (anyone that you have used previously) and let me know the output?

You may consider two cases:

  1. Run the following script with no change.
  2. Copy Line 1 to Line 165 (up to check_orthonormal) of your Python script for MRCISD, to the beginning of the following script (before the import numpy as np, time line), then run it. (This is to test the possible problem due to import order/runtime library conflicts).

import numpy as np, time

np.random.seed(1234)

n = 2 ** 16
mrci_thrds = 1E-10

wmat = np.random.random(n)
wmat[:n // 2] *= mrci_thrds
wmat[n // 2:] += mrci_thrds
np.random.shuffle(wmat)
pmat = np.random.random((n, n))
umat = np.linalg.qr(pmat)[0]
xmat = (umat * wmat[None]) @ umat.T

t1 = time.perf_counter()
w, v = np.linalg.eigh(xmat)

t2 = time.perf_counter()
idx = w > mrci_thrds
u = v[:, idx] * (w[idx] ** (-0.5))

t3 = time.perf_counter()

print('psum   = ', np.sum(pmat))
print('pnorm  = ', np.linalg.norm(pmat))
print('pcubic = ', np.sum(pmat ** 3))

print('xsum   = ', np.sum(xmat))
print('xnorm  = ', np.linalg.norm(xmat))
print('xcubic = ', np.sum(xmat ** 3))

print('vsum   = ', np.sum(v))
print('vnorm  = ', np.linalg.norm(v))
print('vcubic = ', np.sum(v ** 3))

print('ushape = ', u.shape)
print('usum   = ', np.sum(u))
print('unorm  = ', np.linalg.norm(u))
print('ucubic = ', np.sum(u ** 3))

print('wnorm  = ', np.linalg.norm(w))
print(w[:4], '\n', w[-4:], '\n', idx[:10], '\n', idx[-10:])

print('time = ', t2 - t1, t3 - t2)
hczhai commented 7 months ago

The above script used random matrices, which may not match the actual situation. The following script can exactly reproduce the data that you will have in DMRG-FIC-MRCISD (only for the diag overlap rsabtupq* size = 65536 part). The only required input is the 2pdm file 06-twopdm.txt (attached: 06-twopdm.txt).

Please use the following script for testing instead of the one provided in the previous reply.

import numpy as np, time

nact, nvirt, nelec = 16, 16, 16
mrci_thrds = 1E-10
deltaAA = np.eye(nact)
deltaEE = np.eye(nvirt)

E2 = np.zeros((nact, nact, nact, nact), order='C')
with open("06-twopdm.txt", "r") as f:
    E2.reshape(-1)[:] = [float(l.strip().split()[-1]) for l in f.readlines()[1:]]
E2 = np.ascontiguousarray(2.0 * E2.transpose(0, 1, 3, 2))

xmat = np.zeros((nvirt, nvirt, nact, nact, nvirt, nvirt, nact, nact))
xmat += np.einsum('ur,ts,pqba->rsabtupq', deltaEE, deltaEE, E2, optimize=True)
xmat += np.einsum('us,tr,pqab->rsabtupq', deltaEE, deltaEE, E2, optimize=True)
xmat = xmat.reshape((-1, nvirt * nvirt * nact * nact))

print('xsum   = ', np.sum(xmat))
print('xnorm  = ', np.linalg.norm(xmat))
print('xcubic = ', np.sum(xmat ** 3))

t1 = time.perf_counter()
w, v = np.linalg.eigh(xmat)

t2 = time.perf_counter()
idx = w > mrci_thrds
u = v[:, idx] * (w[idx] ** (-0.5))

t3 = time.perf_counter()

print('vsum   = ', np.sum(v))
print('vnorm  = ', np.linalg.norm(v))
print('vcubic = ', np.sum(v ** 3))

print('ushape = ', u.shape)
print('usum   = ', np.sum(u))
print('unorm  = ', np.linalg.norm(u))
print('ucubic = ', np.sum(u ** 3))

print('wnorm  = ', np.linalg.norm(w))
print(w[:4], '\n', w[-4:], '\n', idx[:10], '\n', idx[-10:])

print('time = ', t2 - t1, t3 - t2)

The reference correct output for the above:

xsum   =  74028.55313691219
xnorm  =  485.73353755963234
xcubic =  277500.4970301131
vsum   =  243.72961500908983
vnorm  =  255.99999999999997
vcubic =  9.539483696360193
ushape =  (65536, 32896)
usum   =  -5612.461844063661
unorm  =  3300.4487602465124
ucubic =  -51208.796133034106
wnorm  =  485.73353755963313
[-1.70702535e-14 -1.67275979e-14 -1.67064416e-14 -1.66885439e-14] 
 [8.13188693 8.13188693 8.13188693 8.13188693] 
 [False False False False False False False False False False] 
 [ True  True  True  True  True  True  True  True  True  True]
time =  4349.1460023853 27.50447132764384
1234zou commented 7 months ago

Here is the output using the Python script and 06-twopdm.txt you provided:

xsum   =  74028.55313691219
xnorm  =  485.7335375596325
xcubic =  277500.4970301131
vsum   =  0.0
vnorm  =  0.0
vcubic =  0.0
ushape =  (65536, 32843)
usum   =  0.0
unorm  =  0.0
ucubic =  0.0
wnorm  =  15.189672259130365
[2.34983399e-310 2.27922443e-316 2.27922443e-316 2.27922443e-316]
 [-6.76551238e-06  1.04130488e-04  1.16992676e-04 -5.20583656e-06]
 [False False False False False False  True  True False False]
 [ True False  True  True False  True False  True  True False]
time =  0.0046670830342918634 9.880773521959782

It can be seen that our results differ since the vsum term. This time I use python=3.9.13 and numpy==1.21.5. What else can I do to help debug?

hczhai commented 7 months ago

Thanks for the feedback. Your output shows that the data in xmat is correct. But the eigenvalue and eigenvector w and v are not computed correctly. So it is clear that the problem happens in the line w, v = np.linalg.eigh(xmat). As the script only uses numpy, we can confirm that the problem is in your numpy installation. Here are some suggestions:

  1. You may install numpy using pip rather than conda. If you want to install it offline, you can go to https://pypi.org/project/numpy/#files, download the file named numpy-1.26.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl, and then in the computational node run pip install numpy-1.26.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl to install it (after you put the whl file in the current directory).
  2. It is likely that your current numpy cannot diagonalize matrices with size 65536 x 65536 or bigger. You can decrease the number of nvirt. I guess for nvirt < 16 it will work.
  3. Since you already installed block2, you can change w, v = np.linalg.eigh(xmat) to the following to circumvent the problem in numpy.linalg.eigh.
    import block2
    w, v = np.zeros(len(xmat)), np.copy(xmat)
    block2.MatrixFunctions.eigs(v, w)
    v = v.T
  4. You can also change np.linalg.eigh to scipy.linalg.eigh and see if it can work.

If 3 or 4 works ("works" means that you get non-zero vsum, vnorm, etc. Comparing to my output, only vnorm, unorm, wnorm, ushape should match. Because of the degeneracy in eigenvalues, other printed quantities can differ when you use different implementation of the eigensystem solver. So they do not have to match, as long as they are both non-zero), you can do the corresponding change in the DMRG-FIC-MRCISD code (note that there are two calls to np.linalg.eigh), which should solve the problem.

hczhai commented 7 months ago

It seems that if you are using the Intel Distribution for Python, the problem you see is related this: https://community.intel.com/t5/Intel-Distribution-for-Python/numpy-linalg-eigh-fails-for-large-matrices/m-p/1540256. The page shows that the last reply was in 01-25-2024, so I guess they have not got a chance to release the fix. Beside the suggestions in the above reply, you can switch to the normal Python implementation, which does not have this bug.

1234zou commented 7 months ago

I use Anaconda Python 3 (a large offline package downloaded from Anaconda website or its mirror), and it already contains numpy, scipy, etc. Adding checking words into the slurm script

python prt_np_version.py
which python
python --version

it can be found that

1.21.5
/public/home/jxzou/software/anaconda3/bin/python
Python 3.9.13

So it seems the Anaconda Python is indeed used. I'll create a new virtual environment (with the same python and scipy) and use pip install to install numpy-1.26.4, to see if there is something different.

hebrewsnabla commented 7 months ago

I've made some test for the 06-twopdm.txt case

vsum
np 1.21 from conda 0.0
np 1.26 from conda 0.0
np 1.26 from pip 413.2
scipy instead running

I suspect this could be a similar issue to numpy/numpy#22590. But I can't explain why numpy from pip is different (because openblas is different from MKL?).

hczhai commented 7 months ago

So it seems the Anaconda Python is indeed used.

To check whether the Python is Intel or Anaconda, one can start the Python interpreter interactively in the terminal. If it prints the following it is Anaconda Python (the Python/GCC version and the date are not important):

Python 3.8.18 (default, Sep 11 2023, 13:40:15) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

If it prints the following it is Intel Python:

Python 3.9.16 (default, Jun 4 2021, 06:52:02)
[GCC 9.3.0] :: Intel Corporation on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

Note that even if you have Anaconda Python, you may end up with an Intel implementation of numpy (which contains the bug) when you do conda install numpy or use the default numpy (see below).

But I can't explain why numpy from pip is different (because openblas is different from MKL?).

The bug is in the Intel implementation of numpy, not in openblas or MKL. This page https://anaconda.cloud/intel-optimized-packages explains that when you do conda install numpy (the defaults channel), you are actually installing the Intel implementation of numpy. The pip package has nothing to do with Intel, so pip install numpy will be good. You may also compile numpy from source (any version), using either MKL or openblas as the backend, it will also be good.

hebrewsnabla commented 7 months ago

Update: use scipy from conda also gives right result but takes 5x more time. The reason may be scipy.linalg.eigh uses dsyevr solver which is more robust but slower (compared to numpy's dsyevd solver).

The bug is in the Intel implementation of numpy, not in openblas or MKL. This page https://anaconda.cloud/intel-optimized-packages explains that when you do conda install numpy (the defaults channel), you are actually installing the Intel implementation of numpy.

Thanks for the information. So yes numpy from pip should be a good solution in this issue. Let's see if this solves @1234zou 's original question.

1234zou commented 7 months ago

To check whether the Python is Intel or Anaconda, one can start the Python interpreter interactively in the terminal.

I tried in the corresponding virtual environment and it is Anaconda Python

Python 3.9.13 (main, Oct 13 2022, 21:15:33)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

But as you said, the conda install numpy can still be Intel implementation. This time I use pip install to install various versions of numpy in their respective virtual environments. I find pip install does some help, but it also depends on the version of numpy. Here are results (with the same python 3.9.13, no scipy installed)

numpy version vsum
1.21.4 Segmentation fault
1.21.5 Segmentation fault
1.21.6 Segmentation fault
1.22.0 235.484
1.23.0 235.484
1.24.0 243.730
1.24.3 243.730
1.24.4 243.730
1.25.0 235.484
1.26.0 235.484

Here are input and output files using numpy==1.21.6 (Just for taking a look at what happened. There is no need to solve this problem under this old version of numpy). np1_21_6.zip

I do not know whether the numerical value of vsum can be different for different versions of numpy. According to @hebrewsnabla 's result, scipy is much slower so I do not try that. Current conclusion may be: use pip install and numpy>=1.22.0.

hczhai commented 7 months ago

Thanks for the update.

I find pip install does some help, but it also depends on the version of numpy.

I can confirm that numpy==1.21.6 does not work for this problem. The reason was explained in the release note of numpy==1.22.0: https://github.com/numpy/numpy/releases/tag/v1.22.0. In the end of the first paragraph you can see "All 64 bit wheels are also linked with 64 bit integer OpenBLAS, which should fix the occasional problems encountered by folks using truly huge arrays." This means before version 1.22.0, the pip numpy package was built with 32 bit integer OpenBLAS which cannot handle arrays larger than 32768 x 32768 (actually for numpy.linalg.eigh 32-bit is already enough, but the segfault happens when you do numpy.linalg.norm for the big array).

I do not know whether the numerical value of vsum can be different for different versions of numpy.

It is okay to see different vsum. Only wnorm and ushape should match exactly (up to small numerical errors). (vnorm and unorm may also match, but strictly speaking they do not have to.)

Please go ahead to redo the DMRG-FIC-MRCISD calculation and let me know whether the original problem can be solved. Remember that the whole calculation will need more than 500 GB memory because the last step requires the full diagonalization of a 98177 x 98177 matrix to get the correlation energy.

hczhai commented 6 months ago

With the correct installation of numpy, the DMRG-FIC-MRCISD calculation with CAS(16o, 16e) should print the following results:

HMAT basis size = 98177 thrds = 1e-10
HMAT symm error =    0.8607613970
E(MRCI) - E(ref) = -0.0059030037211798 DC = -1.916902656368702e-05
E(WickICMRCISD)   = -8.089888814864775  E_corr_ci = -0.0059030037211798
E(WickICMRCISD+Q) = -8.089907983891338  E_corr_ci = -0.005922172747743488

As this does not seem to be a problem of block2, I will close this issue.