QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
302 stars 139 forks source link

Scaling of optimizer with number of parameters #3633

Open markdewing opened 3 years ago

markdewing commented 3 years ago

We would like the linear optimizer to scale to a large number of parameters. Generally the more parameters there are in the optimized wavefunction, the more accurate the wavefunction. This issue records some investigation into memory and time scaling with respect to the number of parameters.

Sources of memory usage (Np - number of parameters, Ns - number of samples, Ne - number of electrons)

  1. Parameters and associated info in VariableSet - O(Np)
  2. Values of psi, and E for each sample - O(Ns), electron positions per sample O(NsNe), derivatives per sample O(NsNp)
  3. Overlap and Hamiltonian matrices used in solving linear method eigenproblem - O(Np * Np)

The source of memory usage is more detail

1. Parameters in VariableSet Some discussion in #3625

There are often multiple copies of VariableSet, one at the each wavefunction component level. At the top level of the optimizer, there are two copies, for dealing with constraints.

I instrumented the code, and ran on a hydrocarbon molecule with 1-body and 2-body Jastrows, and CI coefficients. The number of coefficients was varied to get the range of total variational parameters. There were 26 instantiations of VariableSet, most of them zero-sized or small (Jastrows). There were 4 large instantiations (CI coeff). The space usage is about 124 bytes/parameter for an individual variable set, and 500 bytes/parameter considering all the variable sets.

(Technical aside: Many the data structures in VariableSet use the parameter name in a std::string. The parameter names were all short enough (<16 characters) to use the short-string optimization. The short string fits in the std::string itself, and there is no additional dynamic allocation. If the names were longer than 15 characters, we would need to measure the allocated string space as well.)

2. Sample related storage I have run into this memory limit with small systems (fast to generate samples) and large sample numbers (trying to be accurate for testing).

There are two variables in QMCCostFunctionBatched of size Ns x Np: DerivRecords and HDerivRecords

The solution will be to collect the accumulated derivative quantities during the run. The eases the memory pressures (would be O(Np)). For correlated sampling, the samples need to be saved - O(Ns*Ne) and psi and the energy saved per sample O(Ns).

3. Overlap and Hamiltonian matrices In one_shift_run there are four matrices of size Np x Np: ovlMat, hamMat, invMat and prdMat

Just to make some concrete numbers for a sense of scale, assume the number of samples is 10000 and vary the number of parameters

Np VariableSet sample-related storage linear method matrices
10k .005 GB 1.6 GB 3.2 GB
50k .025 GB 8.0 GB 80 GB
100k .05 GB 16 GB 320 GB
1m .5 GB 160 GB 32000 GB

Scaling in time

As for scaling in time, solving the generalized eigenvalue problem is O(Np^3), and will grow to dominate the time fairly quickly. I have created a kernel for this for further exploration at https://github.com/markdewing/qmc_kernels/tree/master/kernels/generalized_eigensolve

ye-luo commented 3 years ago
  1. DerivRecords and HDerivRecords intermediate storage needs to go away. Only statistical average can be stored in memory. Apart from collecting electron configuration samples (x,y,z), no other memory should grow with the number of samples.
  2. This is more of an algorithmic/implementation choice. The straight forward implementation does needs (Np^2) memory. Not surprised by that. This can be optimized via subspace or distributed computation or use a gradient methods.
markdewing commented 3 years ago

D'oh! I got the needs of correlated sampling wrong. The main text was updated.

PDoakORNL commented 2 years ago

More important than memory optimization, VariableSet is bad design. It creates a strong and pretty useless coupling between the wfc and subelements and the optimizer module. The number of different pair vectors is quite a smell and on top of that not documented. Why the clients must have there own VariableSet is weird. So far I haven't found anywhere it doesn't collect a bunch of useless copies of invariant information for a particular optimizer client.

This seems necessary because the optimizer has to get everything it knows out of each parameter from this flat structure. Ultimately there seems to be a great deal of distant branching controlled by what is put in here on a parameter by pararmeter basis which is not easy to follow or well structured.

If QMCOptimizerBase handed a method pointer to TrialWaveFunction then the direction of the call to setup the state in the optimizer for each wfc would be more straightforward. Additionally propagating the call and call back through the WFC and there basis would facilitate building the correct mapping hierarchy immediately rather than giving up on that or rebuilding it by text processing the keys for each parameter value.

Each element under twf that publishes parameters and the optimization arguments for those parameters gets a unique mapping id. An efficient number of identifiers, type mapes, optimization method tags would be stored, i.e. some significant memory reduction. More importantly I think it would result in much clearer code not dependent on runtime string matching.

All the mapping id should entitle the wfc or whatever to is to get the address of contiguous memory to copy its next set of optimized parameters from. It is not assumed that is strored in one vector of std::pair because that either means lots of copies or bad flexibility in the optimizer design.

markdewing commented 2 years ago

Some thoughts on a redesign of VariableSet.

The insert method should take a reference (or pointer) instead of a value. And a size (since it could refer to a group of values). Storing a reference (or pointer) to the underlying value could result in some major changes to how the class is used.

The prototype would look something like this: insert(const std::string &vname, value_type& v, size_t size, bool enable=true, int type=OTHER_P);

Local and intermediate copies of VariableSet would no longer be necessary.

The calling sequence could also be simplified. The call to checkInVariables that accumulates all the parameters would need to stay. The call to checkOutVariables would no longer be necessary. The call to resetParameters would still be needed because some components have to perform some computation on the parameters to create values actually used in evaluation. It would no longer need to pass the updated values.

I'm still not sure how to handle the hierarchy. The conversion to a linear index for the optimizer could happen entirely internal to VariableSet.

ye-luo commented 2 years ago

VariableSet to serve both WFC local use and TWF global use is just not necessary. I feel bad about checkIn, checkOut and resetParam. CheckOut needs to update each WFC its offset to the global array is also unnecessary. We can just have just check to return a vector of references to the variable parameter data structure in each component. We can reset parameters without resetParam but we may need an API to update some internal parameter of the WFC after updating variable parameters.

I finished typing the above without reading what @markdewing wrote but it looks largely the same idea.

I was actually think of keeping VariableSet stored locally in each WFC as the data structure to interact with the Opt driver. It doesn't need to copy values but just keep a list of name, pointer and size.

markdewing commented 7 months ago

If the eigenvalue problem is solved with an iterative solver ( https://github.com/QMCPACK/qmcpack/issues/3788#issuecomment-2037803419 ) , then the fillOverlapHamiltonianMatrices function becomes the performance bottleneck when increasing the number of parameters. It scales as $N_s * N_p ^2$, where $N_s$ is the number of samples and $N_p$ is the number of parameters. It is structurally similar to a matrix multiply in terms of loops and memory accesses. (I haven't looked to see if the operations can be rearranged to be actual BLAS operations)

The rot_det_speedup branch contains some prototypes for offloaded versions of fillOverlapHamiltonianMatrices.

The function has also been extracted to qmc_kernels as the fill_matrices kernel.