Quick question about memory allocation issue

I am seeing the following memory error and would like to understand the background:

Sweep =   24 | Direction =  forward | Bond dimension = 1000 | Noise =  1.00e-04 | Dav threshold =  1.00e-05
 --> Site =    0-   1 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 1.70e+05 Tdav = 0.00 T = 0.00
 --> Site =    1-   2 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 3.45e+05 Tdav = 0.00 T = 0.00
 --> Site =    2-   3 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 5.54e+05 Tdav = 0.00 T = 0.01
 --> Site =    3-   4 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 7.96e+05 Tdav = 0.00 T = 0.01
 --> Site =    4-   5 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 1.11e+06 Tdav = 0.00 T = 0.01
 --> Site =    5-   6 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 1.51e+06 Tdav = 0.00 T = 0.01
 --> Site =    6-   7 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 1.78e+06 Tdav = 0.00 T = 0.01
 --> Site =    7-   8 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 2.25e+06 Tdav = 0.00 T = 0.01
 --> Site =    8-   9 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 2.65e+06 Tdav = 0.00 T = 0.01
 --> Site =    9-  10 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 2.83e+06 Tdav = 0.00 T = 0.01
 --> Site =   10-  11 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 3.39e+06 Tdav = 0.00 T = 0.01
 --> Site =   11-  12 .. Mmps =    1 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 1.40e+08 Tdav = 0.00 T = 0.01
 --> Site =   12-  13 .. Mmps =    3 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 3.53e+09 Tdav = 0.00 T = 0.02
 --> Site =   13-  14 .. Mmps =   10 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 2.16e+10 Tdav = 0.01 T = 0.09
 --> Site =   14-  15 .. Mmps =   34 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 4.31e+10 Tdav = 0.03 T = 0.21
 --> Site =   15-  16 .. Mmps =  100 Ndav =   1 E =   -229.4613379355 Error = 0.00e+00 FLOPS = 6.64e+10 Tdav = 0.06 T = 0.24
 --> Site =   16-  17 .. Mmps =  299 Ndav =   1 E =   -229.4613379355 Error = 5.97e-14 FLOPS = 1.26e+11 Tdav = 0.10 T = 0.39
 --> Site =   17-  18 .. Mmps =  880 Ndav =   1 E =   -229.4613379355 Error = 1.87e-13 FLOPS = 2.20e+11 Tdav = 0.22 T = 0.66
 --> Site =   18-  19 .. Mmps = 1000 Ndav =   2 E =   -229.4613446557 Error = 4.93e-09 FLOPS = 2.92e+11 Tdav = 1.38 T = 3.60
 --> Site =   19-  20 .. Mmps = 1000 Ndav =   2 E =   -229.4614121851 Error = 2.79e-06 FLOPS = 3.58e+11 Tdav = 5.05 T = 10.21
 --> Site =   20-  21 .. Mmps = 1000 Ndav =   2 E =   -229.4614134905 Error = 3.88e-07 FLOPS = 3.48e+11 Tdav = 3.19 T = 8.22
 --> Site =   21-  22 .. Mmps = 1000 Ndav =   1 E =   -229.4614129209 Error = 2.67e-08 FLOPS = 2.26e+11 Tdav = 0.60 T = 4.20
 --> Site =   22-  23 .. Mmps = 1000 Ndav =   1 E =   -229.4614129088 Error = 5.55e-10 FLOPS = 1.31e+11 Tdav = 0.21 T = 3.63
 --> Site =   23-  24 .. Mmps =  997 Ndav =   1 E =   -229.4614129088 Error = 3.66e-13 FLOPS = 5.69e+10 Tdav = 0.19 T = 4.44
 --> Site =   24-  25 .. exceeding allowed memory (size=135000000, trying to allocate 99692)  (double)

This is during a hybrid multi node MPI/OMP job on nodes with 32 cores and 512GB of RAM per node with 2 MPI processes per node and num_thrds 16 (this has turned out to be optimal in terms of runtime) and 16 orbitals. Slurm reports a peak memory usage of ~150GB which is far below the available 512GB.

I am thus wondering: Am I really running out of memory here, or is this some artificial boundary that I am hitting that can be changed by means of a configuration setting? Many thanks in advance!

You can increase the mem parameter in the input file dmrg.conf. For example, you can set mem 10 g.

Great! That does the trick. So I was running out of stack memory here? I didn't expect this since the error came up during MPS optimization and I would have thought that in this step memory usage should be more or less flat? Is there a good rule of thumb for setting mem as a function of bond dimension and number of orbitals? Or asking in another way: On nodes with a few 100GB of RAM, what setting would you recommend?

I didn't expect this since the error came up during MPS optimization and I would have thought that in this step memory usage should be more or less flat?

There is a middle site transformation near the middle site, which consumes roughly two times the normal memory. If you want to make the memory usage lower, you can set qc_mpo_type nc to disable the middle site transformation but the efficiency will also be lower, see documentation for detailed explanation of this keyword.

So I was running out of stack memory here?

The best way to do a CASCI-DMRG is not to use the casci keyword in block2. Instead, you can have a look at CASCI integral generation and generate an integral including only the active orbitals. Then the number of sites in block2 will be much smaller, and both the memory and CPU performance will be much better. The result will be the same.

If you are using very large number of mpi processors, another keyword simple_parallel will also save some memory usage per mpi processor (but the efficiency may be lower). This requires at least block2 p0.5.1a or the most updated block2 source code.

On nodes with a few 100GB of RAM, what setting would you recommend?

I recommend set mem to 50 GB to 100 GB if the physical memory available for each mpi processor is near 100 GB. Normally there will not be any problem if this value is too large (it will only consume the minimal required amount). But if sum of this value for all processors in one node is super large like 2 times the physical memory available in that node, there will be a problem of exhausted memory address space (but not the memory itself), where the error message will be like std::bad_alloc.

Thanks! Very useful. I understand that the middle site transformation requires more memory. The above error didn't appear during the first sweep, but after a few successful sweeps had already been done, so I didn't consider that this could be the problem. But now I see that this was indeed the first sweep after the bond dimension had been increased by the schedule.

It also appears that I misinterpreted what is meant by "stack memory" here. I thought that mem memory would always be blocked upfront and then no longer be available for dynamically allocated memory later. Now, I interpret your answer as: "It doesn't hut to set mem high (close to the amount of RAM per MPI process) because it is only an upper bound on the memory". Is that correct?

I don't think I will need qc_mpo_type nc or simple_parallel as I think the nodes should have enough memory for 2 MPI processes per node (and I even have higher memory nodes available) and that seemed to be optimal performance wise.

Thanks also for the comment about the casci option. I was actually confused that the log showed more sites than are in my CAS space. Calculation time actually does depend on the CAS space size set via the casci option, but I understand the log output and your comment that block2 does not actually only parametrize and solve the problem inside the specified CAS but do "something" also outside the selected CAS.

Isn't this a but counter intuitive? I would have expected to get the same result at roughly the same cost irrespective of whether I do the restriction to a given CAS before or after the fcidump creation.

What does block2 currently do concretely when the casci option is given and is that going to change in the future?

I thought that mem memory would always be blocked upfront and then no longer be available for dynamically allocated memory later. ... because it is only an upper bound on the memory". Is that correct?

Both the first point and the second point are correct (to some extent). It may be a little bit technical. In C/C++, if you create a large array using malloc or new operator like double *p = new double[100000], the Linux operating system simply labels a large chunk of address (page table) to be "occupied", but at this point the memory consumption is zero. Only when you write something to some portion of the address, it will cost memory. This is called copy-on-write.

You can even create an array with length larger than the physical memory size in this way, as long as you only write data into a small part in it. But if you do vector<double> v(100000) it will immediately consumes memory, because vector writes zero when initialization. The same thing applies in numpy, namely, np.empty() does not consume memory but np.zeros() consumes memory.

When you type the top command (on Linux) to check the CPU and memory usage, there are two columns, VIRT (virtual memory) which is the occupied memory address space (can be larger than the physical memory size) and RES (residential memory) which is the actually used (written) amount.

When you create a large array ("stack") but only uses part of it, the remaining physical memory can still be used via dynamical memory allocation.

Isn't this a but counter intuitive? I would have expected to get the same result at roughly the same cost irrespective of whether I do the restriction to a given CAS before or after the fcidump creation.

This is not counter intuitive, because there is one default way to do CASCI-DMRG which is indicated in the Basic Usage page in the documentation, which is generating a fcidump only for the CAS (using pyscf). The casci option only appears in the Keywords page in the documentation and there are tons of other options that can decrease performance. In other words, many options in the Keywords page will give you "the same result at very different cost".

What does block2 currently do concretely when the casci option is given and is that going to change in the future?

The basic idea behind this is that block2 is designed to be flexible enough for research purpose. The casci option is just a special case of arbitrary order uncontracted DMRG-MRCI. Therefore, casci (order = 0), mrcis (order = 1), mrcisd (order = 2), mrcisdt (order = 3), dmrgfci (order = infinity), ... all share the same code. Although casci can be implemented alternatively using an effective FCIDUMP, the same trick cannot be used for MRCI. And the "efficiency" of casci is actually okay if compared with mrcisd. Developer can use this casci option to check that the MRCI implementation is partially correct since there are two ways to achieve the same result.

A user would expect the code to be heavily optimized and specialized, but such a code will not be very convenient for developers because to add a simple new feature one may need to deal with lots of optimization tricks. In block2 we try to alleviate this problem by making it possible to turn off almost every optimization trick (such as the "middle site transformation"). As a result, in the code both the slow version and the efficient version are kept. We may optimize casci in future but the current slow casci will be kept as long as it is a good tool for developers.

Thanks very much for this very comprehensive explanations! Understood.

block-hczhai / block2-preview

Quick question about memory allocation issue #12