Open markdewing opened 2 years ago
DerivRecords
and HDerivRecords
intermediate storage needs to go away. Only statistical average can be stored in memory. Apart from collecting electron configuration samples (x,y,z), no other memory should grow with the number of samples.D'oh! I got the needs of correlated sampling wrong. The main text was updated.
More important than memory optimization, VariableSet is bad design. It creates a strong and pretty useless coupling between the wfc and subelements and the optimizer module. The number of different pair vectors is quite a smell and on top of that not documented. Why the clients must have there own VariableSet is weird. So far I haven't found anywhere it doesn't collect a bunch of useless copies of invariant information for a particular optimizer client.
This seems necessary because the optimizer has to get everything it knows out of each parameter from this flat structure. Ultimately there seems to be a great deal of distant branching controlled by what is put in here on a parameter by pararmeter basis which is not easy to follow or well structured.
If QMCOptimizerBase handed a method pointer to TrialWaveFunction then the direction of the call to setup the state in the optimizer for each wfc would be more straightforward. Additionally propagating the call and call back through the WFC and there basis would facilitate building the correct mapping hierarchy immediately rather than giving up on that or rebuilding it by text processing the keys for each parameter value.
Each element under twf that publishes parameters and the optimization arguments for those parameters gets a unique mapping id. An efficient number of identifiers, type mapes, optimization method tags would be stored, i.e. some significant memory reduction. More importantly I think it would result in much clearer code not dependent on runtime string matching.
All the mapping id should entitle the wfc or whatever to is to get the address of contiguous memory to copy its next set of optimized parameters from. It is not assumed that is strored in one vector of std::pair because that either means lots of copies or bad flexibility in the optimizer design.
Some thoughts on a redesign of VariableSet.
The insert
method should take a reference (or pointer) instead of a value. And a size (since it could refer to a group of values).
Storing a reference (or pointer) to the underlying value could result in some major changes to how the class is used.
The prototype would look something like this:
insert(const std::string &vname, value_type& v, size_t size, bool enable=true, int type=OTHER_P);
Local and intermediate copies of VariableSet would no longer be necessary.
The calling sequence could also be simplified. The call to checkInVariables
that accumulates all the parameters would need to stay. The call to checkOutVariables
would no longer be necessary. The call to resetParameters
would still be needed because some components have to perform some computation on the parameters to create values actually used in evaluation. It would no longer need to pass the updated values.
I'm still not sure how to handle the hierarchy. The conversion to a linear index for the optimizer could happen entirely internal to VariableSet.
VariableSet to serve both WFC local use and TWF global use is just not necessary. I feel bad about checkIn, checkOut and resetParam. CheckOut needs to update each WFC its offset to the global array is also unnecessary. We can just have just check to return a vector of references to the variable parameter data structure in each component. We can reset parameters without resetParam but we may need an API to update some internal parameter of the WFC after updating variable parameters.
I finished typing the above without reading what @markdewing wrote but it looks largely the same idea.
I was actually think of keeping VariableSet stored locally in each WFC as the data structure to interact with the Opt driver. It doesn't need to copy values but just keep a list of name, pointer and size.
If the eigenvalue problem is solved with an iterative solver ( https://github.com/QMCPACK/qmcpack/issues/3788#issuecomment-2037803419 ) , then the fillOverlapHamiltonianMatrices
function becomes the performance bottleneck when increasing the number of parameters.
It scales as $N_s * N_p ^2$, where $N_s$ is the number of samples and $N_p$ is the number of parameters. It is structurally similar to a matrix multiply in terms of loops and memory accesses. (I haven't looked to see if the operations can be rearranged to be actual BLAS operations)
The rot_det_speedup
branch contains some prototypes for offloaded versions of fillOverlapHamiltonianMatrices
.
The function has also been extracted to qmc_kernels
as the fill_matrices
kernel.
We would like the linear optimizer to scale to a large number of parameters. Generally the more parameters there are in the optimized wavefunction, the more accurate the wavefunction. This issue records some investigation into memory and time scaling with respect to the number of parameters.
Sources of memory usage (Np - number of parameters, Ns - number of samples, Ne - number of electrons)
The source of memory usage is more detail
1. Parameters in VariableSet Some discussion in #3625
There are often multiple copies of VariableSet, one at the each wavefunction component level. At the top level of the optimizer, there are two copies, for dealing with constraints.
I instrumented the code, and ran on a hydrocarbon molecule with 1-body and 2-body Jastrows, and CI coefficients. The number of coefficients was varied to get the range of total variational parameters. There were 26 instantiations of VariableSet, most of them zero-sized or small (Jastrows). There were 4 large instantiations (CI coeff). The space usage is about 124 bytes/parameter for an individual variable set, and 500 bytes/parameter considering all the variable sets.
(Technical aside: Many the data structures in VariableSet use the parameter name in a std::string. The parameter names were all short enough (<16 characters) to use the short-string optimization. The short string fits in the std::string itself, and there is no additional dynamic allocation. If the names were longer than 15 characters, we would need to measure the allocated string space as well.)
2. Sample related storage I have run into this memory limit with small systems (fast to generate samples) and large sample numbers (trying to be accurate for testing).
There are two variables in QMCCostFunctionBatched of size Ns x Np:
DerivRecords
andHDerivRecords
The solution will be to collect the accumulated derivative quantities during the run. The eases the memory pressures (would be O(Np)). For correlated sampling, the samples need to be saved - O(Ns*Ne) and psi and the energy saved per sample O(Ns).
3. Overlap and Hamiltonian matrices In
one_shift_run
there are four matrices of size Np x Np:ovlMat
,hamMat
,invMat
andprdMat
Just to make some concrete numbers for a sense of scale, assume the number of samples is 10000 and vary the number of parameters
Scaling in time
As for scaling in time, solving the generalized eigenvalue problem is O(Np^3), and will grow to dominate the time fairly quickly. I have created a kernel for this for further exploration at https://github.com/markdewing/qmc_kernels/tree/master/kernels/generalized_eigensolve