Closed cvjjm closed 2 years ago
You can increase the mem
parameter in the input file dmrg.conf
. For example, you can set mem 10 g
.
Great! That does the trick. So I was running out of stack memory here? I didn't expect this since the error came up during MPS optimization and I would have thought that in this step memory usage should be more or less flat? Is there a good rule of thumb for setting mem
as a function of bond dimension and number of orbitals? Or asking in another way: On nodes with a few 100GB of RAM, what setting would you recommend?
I didn't expect this since the error came up during MPS optimization and I would have thought that in this step memory usage should be more or less flat?
There is a middle site transformation near the middle site, which consumes roughly two times the normal memory. If you want to make the memory usage lower, you can set qc_mpo_type nc
to disable the middle site transformation but the efficiency will also be lower, see documentation for detailed explanation of this keyword.
So I was running out of stack memory here?
The best way to do a CASCI-DMRG is not to use the casci
keyword in block2
. Instead, you can have a look at CASCI integral generation and generate an integral including only the active orbitals. Then the number of sites in block2 will be much smaller, and both the memory and CPU performance will be much better. The result will be the same.
If you are using very large number of mpi processors, another keyword simple_parallel
will also save some memory usage per mpi processor (but the efficiency may be lower). This requires at least block2 p0.5.1a or the most updated block2
source code.
On nodes with a few 100GB of RAM, what setting would you recommend?
I recommend set mem
to 50 GB
to 100 GB
if the physical memory available for each mpi processor is near 100 GB
. Normally there will not be any problem if this value is too large (it will only consume the minimal required amount). But if sum of this value for all processors in one node is super large like 2 times the physical memory available in that node, there will be a problem of exhausted memory address space (but not the memory itself), where the error message will be like std::bad_alloc
.
Thanks! Very useful. I understand that the middle site transformation requires more memory. The above error didn't appear during the first sweep, but after a few successful sweeps had already been done, so I didn't consider that this could be the problem. But now I see that this was indeed the first sweep after the bond dimension had been increased by the schedule.
It also appears that I misinterpreted what is meant by "stack memory" here. I thought that mem
memory would always be blocked upfront and then no longer be available for dynamically allocated memory later. Now, I interpret your answer as: "It doesn't hut to set mem
high (close to the amount of RAM per MPI process) because it is only an upper bound on the memory". Is that correct?
I don't think I will need qc_mpo_type nc
or simple_parallel
as I think the nodes should have enough memory for 2 MPI processes per node (and I even have higher memory nodes available) and that seemed to be optimal performance wise.
Thanks also for the comment about the casci
option. I was actually confused that the log showed more sites than are in my CAS space. Calculation time actually does depend on the CAS space size set via the casci
option, but I understand the log output and your comment that block2
does not actually only parametrize and solve the problem inside the specified CAS but do "something" also outside the selected CAS.
Isn't this a but counter intuitive? I would have expected to get the same result at roughly the same cost irrespective of whether I do the restriction to a given CAS before or after the fcidump creation.
What does block2
currently do concretely when the casci
option is given and is that going to change in the future?
I thought that mem memory would always be blocked upfront and then no longer be available for dynamically allocated memory later. ... because it is only an upper bound on the memory". Is that correct?
Both the first point and the second point are correct (to some extent). It may be a little bit technical. In C/C++, if you create a large array using malloc
or new
operator like double *p = new double[100000]
, the Linux operating system simply labels a large chunk of address (page table) to be "occupied", but at this point the memory consumption is zero. Only when you write something to some portion of the address, it will cost memory. This is called copy-on-write.
You can even create an array with length larger than the physical memory size in this way, as long as you only write data into a small part in it. But if you do vector<double> v(100000)
it will immediately consumes memory, because vector
writes zero when initialization. The same thing applies in numpy
, namely, np.empty()
does not consume memory but np.zeros()
consumes memory.
When you type the top
command (on Linux) to check the CPU and memory usage, there are two columns, VIRT
(virtual memory) which is the occupied memory address space (can be larger than the physical memory size) and RES
(residential memory) which is the actually used (written) amount.
When you create a large array ("stack") but only uses part of it, the remaining physical memory can still be used via dynamical memory allocation.
Isn't this a but counter intuitive? I would have expected to get the same result at roughly the same cost irrespective of whether I do the restriction to a given CAS before or after the fcidump creation.
This is not counter intuitive, because there is one default way to do CASCI-DMRG which is indicated in the Basic Usage
page in the documentation, which is generating a fcidump only for the CAS (using pyscf
). The casci
option only appears in the Keywords
page in the documentation and there are tons of other options that can decrease performance. In other words, many options in the Keywords
page will give you "the same result at very different cost".
What does
block2
currently do concretely when thecasci
option is given and is that going to change in the future?
The basic idea behind this is that block2
is designed to be flexible enough for research purpose. The casci
option is just a special case of arbitrary order uncontracted DMRG-MRCI. Therefore, casci
(order = 0), mrcis
(order = 1), mrcisd
(order = 2), mrcisdt
(order = 3), dmrgfci
(order = infinity), ... all share the same code. Although casci
can be implemented alternatively using an effective FCIDUMP
, the same trick cannot be used for MRCI. And the "efficiency" of casci
is actually okay if compared with mrcisd
. Developer can use this casci
option to check that the MRCI implementation is partially correct since there are two ways to achieve the same result.
A user would expect the code to be heavily optimized and specialized, but such a code will not be very convenient for developers because to add a simple new feature one may need to deal with lots of optimization tricks. In block2
we try to alleviate this problem by making it possible to turn off almost every optimization trick (such as the "middle site transformation"). As a result, in the code both the slow version and the efficient version are kept. We may optimize casci
in future but the current slow casci
will be kept as long as it is a good tool for developers.
Thanks very much for this very comprehensive explanations! Understood.
I am seeing the following memory error and would like to understand the background:
This is during a hybrid multi node MPI/OMP job on nodes with 32 cores and 512GB of RAM per node with 2 MPI processes per node and
num_thrds 16
(this has turned out to be optimal in terms of runtime) and 16 orbitals. Slurm reports a peak memory usage of ~150GB which is far below the available 512GB.I am thus wondering: Am I really running out of memory here, or is this some artificial boundary that I am hitting that can be changed by means of a configuration setting? Many thanks in advance!