QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
293 stars 137 forks source link

Support "samples" input tag in batched drivers #4229

Open ye-luo opened 2 years ago

ye-luo commented 2 years ago

We used to have samples input tag in legacy drivers. I'd like to add it to batched drivers.

A reminder first, steps_between_samples_ and samples_per_thread_ are not supported in batched drivers.

samples includes all the MPI ranks. samples input is a request to generate at least this number of samples. The actual number of generated samples refers to all the samples in a single VMC or DMC run that contribute to scalar.dat. and it can be actually higher.

In batched driver, we have well-define walker population and thus the interaction are among samples, walker population, steps and blocks. Like the legacy drivers, I consider blocks fixed by from the input handling.

When samples and steps are both provided. Input requires samples <= walker population * steps * blocks. When only steps is provided, samples = walker population * steps * blocks. When only samples is provided, steps = (samples + walker population * blocks - 1 ) / (walker population * blocks) When neither samples nor steps is provided, we treat the run as if steps=1 is provided. This is what I implemented in https://github.com/QMCPACK/qmcpack/pull/4224 as close as possible to the behavior of the legacy drivers.

A common use case is when I run VMC/WFOpt, I'd like to request any number of compute nodes based on the the availablity and I don't need to adjust my input files.

I'm not intended to add more automation but adding a control of samples seems a necessary step for later adding automation based on accuracy.

ye-luo commented 2 years ago

After writing this, I feel more comfortable to make samples interacting with blocks rather than steps

lshulen commented 2 years ago

I like the goal of being able to change parallelization settings and getting at least equivalent results without having to change the input file.
I think for the case of VMC/WFOpt, one more input variable might be needed. Specifically I worry that if we keep the samples fixed and adjust the ranks, this can change the number of VMC steps between each sample. In the case that the number of steps between samples remains greater than the decorrelation time for all parallelization choices, there is no harm. However, if it decreases below the decorrelation time, then runs with large numbers of ranks will result in a larger error bar / poorer optimization. This perhaps a long winded way of saying that I prefer having the option for the user to provide both samples and steps_between_samples.

ye-luo commented 2 years ago

@lshulen from my perspective, substeps are much cheaper to decorrelation samples and this feature is kept in batched VMC/WFOpt drivers. That was the motivation to remove steps_between_samples. Do you still think steps_between_samples necessary?

lshulen commented 2 years ago

Let me make sure I understand the feature and its intent then. A user should set substeps so that all steps are decorrelated? If that works and is unaffected by any of the math concerning samples, than I think steps_between_samples is unnecessary.

Assuming again that I'm correct we might consider changing the name of substeps to make its intent clearer. Perhaps adding the word decorrelation?

ye-luo commented 2 years ago

Here is substeps from the manual https://qmcpack.readthedocs.io/en/develop/methods.html#batched-vmc-driver-experimental

substeps For each substep, an attempt is made to move each of the electrons once only by either particle-by-particle or an all-electron move. Because the local energy is evaluated only at each full step and not each substep, substeps are computationally cheaper and can be used to reduce the correlation between property measurements at a lower cost.

@lshulen users could increase substep to get sufficient decorrelation. There is no automation though. substeps hopefully is well defined as in the manual. I think adding decorrelation is one way of using substeps. Another use of substeps is for fast equlibration because we don't need these samples and their local energy evaluation can be skipped. So I think it is better to keep the current name.

In my experience, changing the name of an existing input tag is extremely hard and involves a lot of work of updating existing input files and the return is usually not very visible unfortunately.

Right now, steps is interacting with samples, @lshulen do you feel better to make blocks interacting with samples?

prckent commented 2 years ago

For consistency with the not-yet-discussed-or-decided tags for VMC, I suspect we'll want "total_samples" or similar and to enforce that the user not specify conflicting blocks/steps settings that conflict with it. As with future VMC inputs where it should be possible to write "total_steps", it should also be possible to write "total_samples" (only) and have the code pick a sensible set of defaults for the steps and blocks settings.

Regarding steps_between_samples and substeps, I think it would be a good simplification to just use substeps. Is there a concrete case not covered if we do this?

lshulen commented 2 years ago

The final state I would like to get to is having a user specify the tolerance with which a given thing is known and a bare minimum about the sampling process in a way that is physically motivated. Those things would be respected and accomplished in the most efficient way possible given the resources available.

The closest we get in the current implementation to specifying the tolerance is samples or (total_samples). Increase this and your optimizer suffers less from noise, you know the given observables better etc. The reason for bringing up decorrelation time is that I wouldn't want increasing parallelization to change the meaning of samples in this regard (by for instance forcing samples too close together).

The next input would be related to the sampling. Here knowing about the decorrelation time is useful. It's also potentially not tied to the timestep, so its intent would be usable in the case of an adaptive timestep type of algorithm. Finally, there is the issue of how blocking affects the ability to analyze the data. If you have too few blocks, then the way we average the data, you don't know how to assess the precision of your result.

So, my desired keywords and behavior at the moment come to this:

  1. total_samples
  2. substeps
  3. blocks (although defaulting to a usable minimum is reasonable)
  4. timestep

I think users should be discouraged from specifying steps, perhaps outside of some edge case benchmarking type of workflow, but there I think total_samples is not really a meaningful input.

Longer term, this would head towards:

  1. sampling accuracy
  2. decorrelation time
  3. minimum blocks

In the case of DMC timestep would be specified, but would be left to the algorithm for VMC.

ye-luo commented 2 years ago

I consider "samples", "total_samples" interchangeable words in this discussion. It is physically related to sampling accuracy and can be directly translated from the latter if we switch to an input tag for accuracy. So I consider it a necessary input.

We also have dedicated total population control. It is related to the actual hardware. "walkers_per_rank" or "total_walkers" The remaining part is how to control "total_samples"/"total_walkers". It is currently broken up into "steps" (steps per block) and "blocks". The requirement is total_samples=total_walkers blocks steps. One of the parameter can be derived from the other three. However, I won't accept making total_walkers derived. In legacy driver and #4224, blocks are required/predetermined input and either of total_samples and steps can be derived from the other. However, I consider "steps" a direct control of I/O frequency and prefer "blocks" to interact with "total_samples".

Technically, there is no ambiguity when two of the three "total_samples", "blocks", "steps" are provided. The difficult case is when only one or none of the three are provided. I think as long as we don't choose too bad default values and document the behavior, users should be responsible for the rest.

I don't consider "steps" a good way of controlling how statistics are measured. Right now it controls both I/O frequency control and block-average granularity (scalar.dat). In the long run, I prefer a way storing data per step rather than keeping data per block in scalar.dat.

The "decorrelation time" can be handled by substeps in VMC and WFOpt.

lshulen commented 2 years ago

My way of looking at this had the I/O frequency and hence the averaging granularity determined by the number of blocks. You appear to prefer letting it be governed by the number of total steps. I think that is likely the right solution in the long run. However, some issues will persist while the frequency of I/O for scalar.dat is set the way it currently is.

As to your list of defaults, I would suggest the following:

I'd also encourage you to revisit the decision that users should be encouraged to specify walker_per_rank or total_walkers in the future. I see the point of this now, but it's entirely an algorithmic question related to the efficiency of hardware and should not affect the final result. The code should strive to set those itself either via sensible defaults for a given parallelization strategy or through making some measurements. My answer is a bit different for DMC because of population bias, but again it would be better to just measure / control population bias internally if possible.

prckent commented 2 years ago

Thanks Luke. I totally agree that we have to have an automatic parallelization strategy soon, and sensible defaults. The code should do something pretty sensible and no user should have to specify anything on a per rank / thread basis unless they particularly want to. It is impossible for a casual user to e.g. achieve good GPU performance otherwise and based on feedback received, a significant productivity tax today.

ye-luo commented 2 years ago

Probably I gave the wrong impression. What I meant was that walker_per_rank or total_walkers were predetermined before handing sample input and they should not be part of the interaction with "total_samples", "blocks", "steps", it is a choice of separation of concerns. Getting walker_per_rank or total_walkers automated is a separate topic that can be better disentangled.

Regarding when only one of "total_samples", "blocks", "steps" was provided. I'm wondering why we need to struggle with setting default. How about asking users to provide at least two and abort otherwise?

lshulen commented 2 years ago

I agree with aborting if only blocks or steps is specified. My preference would be to allow only total_samples to be specified though.

PDoakORNL commented 1 year ago

TLDR; Steps per block is the fundamental control of global synchronization in the code. However we decide to take input it should never end up defaulting to steps=1 unless the user explicitly specifies that.

Detail: Each block is a significant global synchronization event in our algorithms. Global synchronization on the sort of hybrid architectures we are on is to be avoided if at all possible. The full utilization of the GPU in all but the most trivial cases requires multiple desynchronized sequences of kernel launches and transfers.

Right now DMC basically has such a sync every step but hopefully that can be relaxed with an algorithm change. I don't believe there will be any significant gain for constant population DMC at all if steps=1.

Steps=1 is a worst case for performance and our design should never necessitate it. Right now if we want to write per step it would be much better to have an independent permanent io thread, but that's a tall order as long as we're limited by openmp's model. What I already have in the MS_6_9 branch estimators using per step per particle values are quick writes to lockless buffers that can be written at the end of a block when we return to a single threaded context. Unless we want a file per rank this seems the most sensible way. We could probably abstract the buffers so they were actually writing to a file handle per thread on local ssd.

lshulen commented 1 year ago

I think the crux of the problem is that the synchronization is tied to the writing in the current model. We need to ensure that the code always writes with enough detail and or frequency that the data can be analyzed. I'm not fully understanding your lockless buffers comment, but it sounds like that strategy could break the connection between write frequency and blocks (I don't mind at all if many lines are written all at the same time to improve computational efficiency) at which point I no longer care what steps is set to. Until then, I would just comment that computational efficiency is meaningless without the ability to analyze the data.