QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
292 stars 137 forks source link

opt_crowd_size default value #3849

Open ye-luo opened 2 years ago

ye-luo commented 2 years ago

Is your feature request related to a problem? Please describe. Currently 1 which is a poor choice, running all the batches with batch size 1. Preferred way, if user doesn't specify, align with VMC choice.

The value needs to be printed. I figured this out by reading timer call counts.

ye-luo commented 2 years ago

This input parameter is painful to control. Just like walkers, we don't control walkers per crowd but only walker_per_rank.

ye-luo commented 2 years ago

walkers_per_rank=1792, 7 threads (7 crowds) in VMC opt-NiO-fcc-S4.batch1.txt

<opt stage="setup">
  <log>
   Reading configurations from h5FileRoot
  Using Nonlocal PP in Opt: no
  VMC Eavg = -1.4151e+03
  VMC Evar = 4.2034e+02
  Total weights = 1.7203e+06
  Execution time = 3.3972e+02
  </log>
</opt>

After adding add opt_crowd_size=256, >10x difference in Execution time opt-NiO-fcc-S4.batch256.txt

<opt stage="setup">
  <log>
   Reading configurations from h5FileRoot
  Using Nonlocal PP in Opt: no
  VMC Eavg = -1.4151e+03
  VMC Evar = 4.2034e+02
  Total weights = 1.7203e+06
  Execution time = 3.1634e+01
  </log>
</opt>
ye-luo commented 2 years ago

A second thought on this. If we directly borrow the crowds from VMC, we neither need to care about input nor concern about additional memory usage by WFOpt.

prckent commented 2 years ago

Option 3:

Perhaps the crowd count could be output and as well as input to the drivers. The first QMC driver executed (whatever it is) gets to pick a default or take a value from input. This value is passed out. The next driver gets this as input. If the XML for the driver's QMC section has a different value, that overrides.

Option 4:

The application has a single central function to go to for default values of these parallelization settings. This means the logic is not replicated (or even different) on a per driver basis unless there is a very good reason. It also avoids needing to pass state around.

Discussion:

This raises the question of whether (and why) the optimizer will routinely need different crowd sizes from VMC/DMC.

We are going to need some logic to automagically pick walker counts and other physics/simulation settings as well as parallelization settings, so what we learn about crowd settings can factor in to how we treat these other settings e.g. If drivers need very different values, we'll definitely need to set them automagically ourselves since users will not be able to reasonably choose optimal values.

markdewing commented 2 years ago

The optimal number of walkers per crowd could be different for the optimizer cost function and VMC, since the operations are different. However, I suspect the VMC value will work pretty well as a default (better than the current default of 1).

The values for the optimizer parameters (opt_num_crowds, opt_crowd_size) could be set as:

  1. Read from input parameter (optional)
  2. Use VMC values if parameters not explicitly specified

The function that computes all these parameters is QMCDriverNew::adjustGlobalWalkerCount. It is a static function. It does print warning messages, however, so it should not be called repeatedly. (The current batched linear optimizer ends up calling it twice, and I would like to change that.)

One of the plumbing problems with getting the chosen parameters from the driver (AdjustedWalkerCounts) is that they only exist in VMCBatched::process. That function calls QMCDriverNew::adjustGlobalWalkerCount, then calls QMCDriverNew::startup, and the AdjustedWalkerCounts are not saved (except as they affect settings from the call to startup).

One possible solution is to make a version of VMCBatched::process that returns AdjustedWalkerCounts. The optimizer set up code calls VMCBatched::process, so it would be an easy substitution there. And that would make the information from VMC setup available to the optimizer setup.

ye-luo commented 2 years ago

In the current implementation, WFOpt batched driver owns it own set of walker objects (particle set, TWF, Ham) which all cost memory. They are not shared with the VMC owned walker objects. For this reason, the number of walkers used to process samples needs to be carefully selected not to hit the memory wall. Thus (opt_num_crowds, opt_crowd_size) needs to be independently controlled for the WFOpt driver.

However if we just borrow walker objects from the VMC driver, then there is no redundant walker memory. In a memory tight scenery, we don't need to wonder where to reduce walker count and save memory VMC or WFOpt or both. There is only 1 walker count input to be adjusted. This is what the current classic driver does, there is only one set of walker objects both both VMC and WFOpt. It seems working well.

markdewing commented 2 years ago

Some of the extra memory comes from QMCDriverNew? (gets allocated when QMCDriverNew::startup is called). I don't think much of that is used by the optimizer - avoiding the call to startup might help, but that probably sets up some values that are necessary.

ye-luo commented 2 years ago

In CostFunctionCrowdData

  // List of objects for use in flex_* calls
  UPtrVector<TrialWaveFunction> wf_ptr_list_;
  UPtrVector<ParticleSet> p_ptr_list_;
  UPtrVector<QMCHamiltonian> h_ptr_list_;
  UPtrVector<QMCHamiltonian> h0_ptr_list_;
markdewing commented 2 years ago

One solution for memory is to delete the vmcEngine after samples are generated. That seems like it would free up the memory from the crowds created by that driver. The vmcEngine gets deleted and created every iteration of the optimizer, so deleting it sooner doesn't create extra allocations or deallocations. Trying to reuse memory between Crowd and CostFunctionCrowdData seems complicated - either they need to be combined as classes, which seems complicated. Or the walkers need to get moved back and forth.

ye-luo commented 2 years ago

Trying to reuse memory between Crowd and CostFunctionCrowdData seems complicated - either they need to be combined as classes, which seems complicated. Or the walkers need to get moved back and forth.

If we just reference the Crowd of VMC from CostFunctionCrowdData there is nothing being moved around.

ye-luo commented 2 years ago

In addition, you also need to delete all the walker object in CostFunctionCrowdData before creating the VMC object to keep the memory minimal.