Rapid population explosion in batched GPU DMC on Summit

jtkrogel commented 3 years ago

Describe the bug

In a select few cases, the batched DMC drivers crash hard on Summit. The observed cases constitute 5 out of 2100 runs performed with VMC and DMC drivers. The same cases complete successfully with the cpu version of the batched drivers on CADES (all other 2095 cases also complete successfully there). All 5 cases run rapidly on a single Summit node (6 mpi tasks per node, 7 threads per task). The failures are reproducible and all inputs use some version of T-moves.

The cases at hand are summarized by the following keys:

diamond_s14_tg231_sp1_comp_J3_ob_dmc_w66_tm0_batched
diamond_s14_tg111_sp1_real_J3_ob_dmc_w66_tm0_batched
diamond_s14_tg111_sp2_comp_J3_ob_dmc_w66_tm3_batched
diamond_s14_tg111_sp2_real_J3_ob_dmc_w66_tm0_dla_batched
graphene_s206_tg231_sp2_comp_J0_ob_dmc_w66_tm0_batched

The random seed for each run is determined by a hash of each respective key.

To decode the meaning of these inputs:

diamond_s14_tg111_sp2_real_J3_ob_dmc_w66_tm0_dla_batched

system           = diamond
size             = 14 atoms
twist_grid       = 1x1x1
nspin            = 2
qmcpack_build    = real
jastrow          = J3
lr_handler       = optimized breakup
qmc              = dmc
walkers_per_rank = 6
nonlocalmoves    = v0
dla              = yes
driver           = batched

To Reproduce Steps to reproduce the behavior:

Copy the files from the following path on Summit: /gpfs/alpine/proj-shared/mat151/jtkrogel/ecp_roadtest/test_runs/bug_reproduction/set01
Remove the old qmcpack outputs from each directory
In each directory, replace the orbital href in each input file to point at the local copy of pwscf.pwscf.h5
In each directory, update the path to the qmcpack executable in dmc.bsub.in and submit

QMCPACK version: git hash dcf2dc9 (current on 31 Mar 2021).

Expected behavior The runs should complete successfully w/o hard failures.

System:

system name: summit

modules loaded:

module load gcc/8.1.1
module load spectrum-mpi
module load cmake
module load git
module load cuda
module load essl
module load netlib-lapack
module load hdf5
module load python/3.6.6-anaconda3-5.3.0
module use /ccs/proj/mat151/opt/modules
module load llvm/master-latest
export FFTW_HOME=/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/fftw-3.3.8-5gcj2ic4el7acu3rqnfnh735jz2ez7j5
export BOOST_ROOT=/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/gcc-6.4.0/boost-1.66.0-l3sghp3ggjzwi4vtvyb5yzsjm36npgrk

other systems where this is reproducible: unknown

jtkrogel commented 3 years ago

@ye-luo has requested a rerun of the full set with an updated version of LLVM. I will post those results in addition here.

ye-luo commented 3 years ago

I reran all 5 cases with qmcpack built with newer clang. Only one case hits GPU memory limit (allocation failure) and all others are fine. The one consuming higher than expected memory needs some investigation but it will be straightforward. No random crash on my try. @jaron is repeating all the runs with my build of qmcpack and let us wait and see what happens.

jtkrogel commented 3 years ago

The reruns are complete (all 2100). Only one failure this time. It is the case that @ye-luo refers to above.

Following a fix of this case, I will rerun again, but with shuffled seeds and slightly longer runs. For reference, each set of these runs takes a couple of days to complete but only consumes about 100 node hours in total.

ye-luo commented 3 years ago

I don’t have an ETA right now to fix that failure case due to my debug friendly GPU gone. I would recommend to start running longer ones without a fix.

jtkrogel commented 3 years ago

Update: rerunning with 4x samples and shuffled seeds uncovered more failures, but all of the same kind (CUDA error: out of memory in a DMC section). Previously, 1 such case was encountered. Increased sampling produces 82 such cases.

The failures are basically evenly spread for diamond and graphene and only occur at larger system sizes. For each structure, a handful of failures occur for 134 C atoms; many failures occur for 206 C atoms.

Once a fix is in, it would be good to increase the system size even further to flush this out in even more depth.

jtkrogel commented 3 years ago

All of the failures are for no Jastrow, so likely the OOM is related to walker branching.

ye-luo commented 3 years ago

@jtkrogel Just to have more clue. Is it always the same system having this memory issue? or other systems which are not the original one had memory issue in the initial 2100 runs also start to fail with OOM.

jtkrogel commented 3 years ago

The original system was a graphene 206 atom system, so it overlaps. The systems with failures are graphene or diamond with 134 or 206 atoms.

jtkrogel commented 3 years ago

The problem is due to unbounded growth in the walker population. The underlying problem is the same w/ or w/o a Jastrow.

For reference, target walkers is reported as 396 in the log output for each of the cases below.

graphene 206 w/o a Jastrow (OOM): dmc_walkers_J0

graphene 206 w/ J2 (no OOM): dmc_walkers_J2

graphene 206 w/ J3 (no OOM): dmc_walkers_J3

jtkrogel commented 3 years ago

QMCPACK inputs:

   <qmc method="vmc_batch" move="pbyp">
      <parameter name="walkers_per_rank"    >    66              </parameter>
      <parameter name="warmupSteps"         >    2               </parameter>
      <parameter name="blocks"              >    5               </parameter>
      <parameter name="steps"               >    2               </parameter>
      <parameter name="subSteps"            >    2               </parameter>
      <parameter name="timestep"            >    0.3             </parameter>
      <parameter name="useDrift"            >    no              </parameter>
   </qmc>
   <qmc method="dmc_batch" move="pbyp">
      <parameter name="walkers_per_rank"    >    66              </parameter>
      <parameter name="warmupSteps"         >    2               </parameter>
      <parameter name="blocks"              >    5               </parameter>
      <parameter name="steps"               >    2               </parameter>
      <parameter name="timestep"            >    0.02            </parameter>
      <parameter name="nonlocalmoves"       >    no              </parameter>
   </qmc>
   <qmc method="dmc_batch" move="pbyp">
      <parameter name="walkers_per_rank"    >    66              </parameter>
      <parameter name="warmupSteps"         >    2               </parameter>
      <parameter name="blocks"              >    40              </parameter>
      <parameter name="steps"               >    2               </parameter>
      <parameter name="timestep"            >    0.01            </parameter>
      <parameter name="nonlocalmoves"       >    no              </parameter>
   </qmc>

Job submission inputs:

#!/bin/bash
#BSUB -P MAT151
#BSUB -J dmc
#BSUB -o dmc.out
#BSUB -e dmc.err
#BSUB -W 01:00
#BSUB -nnodes 1
#BSUB -alloc_flags "smt1"

module load gcc/8.1.1
module load spectrum-mpi
module load cuda
module load essl
module load netlib-lapack
module load hdf5/1.10.4
# private module until OLCF provides a new llvm build
module use /ccs/proj/mat151/opt/modules
module load llvm/main-20210112

export OMP_NUM_THREADS=7
jsrun -b rs -c 7 -g 1 -d packed -n 6 -r 6 -a 1 /ccs/proj/mat151/opt/qmcpack/develop-20210327/build_summit_Clang_offload_cuda_real/bin/qmcpack dmc.in.xml --enable-timers=fine

jtkrogel commented 3 years ago

The legacy cpu and batched cpu also appear to have this problem. Here is diamond 206 dmc tm0 with cpu legacy v3.10.0 laid alongside cpu and gpu batched drivers (4x shorter runs on series 2 than above):

dw_cpu_gpu_tm0

The lightest colored curve is legacy cpu v3.10.0.

prckent commented 3 years ago

Thanks for unearthing this Jaron. Intriguing. So something is wrong & has been wrong for some time. With t-moves a 0.01 timestep DMC in carbon diamond should be completely robust.

Perhaps check VMC gives the same in all versions as an extra sanity check, plus check the magical svn version ~6K we have used as a reference before? After that checking the walker ages and energy distribution could rule out things like a stuck / not moved walker. It really does look like the population control is using the wrong trial energy or the wrong walker count and not renormalizing the population correctly.

jtkrogel commented 3 years ago

I think it is the population control algorithm. Specifically, I suspect the strange role "warmupsteps" plays in it. I am going to rerun this case cpu only w/ varied warmupsteps and see what shakes out. If this turns out to be the knob determining good and bad behavior, I would request that we ensure small, or even zero, warmupsteps still results in stable runs.

prckent commented 3 years ago

Warmupsteps should have zero effect at medium and long times. If that is the problem we have a bad algorithm in need of replacement.

jtkrogel commented 3 years ago

Here is the result of varying warmupsteps in the legacy and batched cpu drivers. The test case is 206 atom diamond, dmc w/ T-moves and the same random seed regardless of inputted warmupsteps.

The plots below correspond to 10 steps w/ timestep of 0.02 followed by 80 steps w/ timestep 0.01. The same value of warmupsteps was used in both dmc series 1 and 2. The values of warmupsteps used were [0,1,2,3,5,7,11,17,25,38,57,86,129], consistent with the legend. The target population is 396.

Walker population, legacy cpu drivers

The population explosion in series 1 is depends inversely on warmupsteps. Even for largish warmupsteps (e.g. 25) the population eventually begins to diverge. Walker ceiling of 2x target pop is visible. NumOfWalkers_qmcpack_210331_dcf2dc9_legacy

Walker population, batched cpu drivers

Similar behavior w/ batched drivers, though there is no population ceiling. For even larger warmupsteps (57), the population still eventually starts to diverge. NumOfWalkers_qmcpack_210331_dcf2dc9_batched

Trial energy, legacy cpu drivers

Legacy drivers have strong discontinuity in trial energy between series 1 and 2. Continuous only for warmupsteps=0. Unless very large warmupsteps are used, the trial energy is essentially non-stochastic (compare smooth and rough curves). Only for large warmupsteps, w/ trial energy dominated by stochastic effects, is the population size controlled. TrialEnergy_qmcpack_210331_dcf2dc9_legacy

Trial energy, batched cpu drivers (full scale top, zoomed bottom)

Batched drivers. When warmupsteps is 0, the trial energy is also 0 (!!!) in the first step or so. TrialEnergy_qmcpack_210331_dcf2dc9_batched

For warmupsteps>0, batched drivers have smaller discontinuity than legacy, but still visible. Again, only for the largest warmupsteps is the trial energy stochastically dominated and the population stable. TrialEnergy_qmcpack_210331_dcf2dc9_batched_zoom

Overall, these results demonstrate big problems with the population control algorithm in the legacy and batched drivers, particularly w.r.t. the odd behavior relating to warmupsteps.

We need to use algorithms without long memories and/or strong starting point effects. I don't expect it will be that hard to fix this.

ye-luo commented 3 years ago

After looking at the run, the blow up of population doesn't seem to be a bug to me.

During warmup, Etrial is updated based on Enow as Enow is still exponentially decaying into a steady state. It is bad to collect and use EnergyHist(running average of Enow). Using EnergyHist.mean() potentially causes huge delay in population control.

The main statge assumes Enow fluctuates around Eref and collecting EnergyHist and use EnergyHist.mean() for Eref makes sense.

So in these problematic runs, warmupsteps=2 is too small and EnergyHist kicks in too early, so the population control cannot respond quickly enough to prevent the population blow up.

The only fix I can image is introducing a robust scheme to automatically determine warmupsteps.

jtkrogel commented 3 years ago

In some of these cases, warmupsteps of 57 is too small. We need to do some fundamental rethinking here. I'm really not a fan of the "non-robustness is not a bug" philosophy.

ye-luo commented 3 years ago

@jtkrogel you have a feature request having self-determined warmupsteps. For a given warmupsteps, current implementation respect user input even it doesn't seem like a good choice. We have been bitten too many times when trying to improve things by ignoring input value. The intended behavior should be having automatically determined warmupsteps when input tag is not given. Otherwise, respect the input and do exactly what it asks.

The automation needs to handle

both unconverged and converged trajectory work with the algorithm
size consistent. When system sizes double, the auto determined warmupsteps should be roughly the same.
warmupsteps depends on the error bar of the population energy.
warmupsteps may be determined on the fly.

prckent commented 3 years ago

The requirement here is to have a basic algorithm that is correct and safe. This is a minimal requirement for a production science code and competing codes get this right.

Fancy automation later.

ye-luo commented 3 years ago

@prckent the challenge here is what is that "basic algorithm"? To satisfy the correct and safe, it seems like needing a "fancy algorithm" handling certain degree of automation. The spec listed are actually required for "correct and safe". Any suggestions of schemes?

jtkrogel commented 3 years ago

I think the main problem relates to how the trial energy is set and updated. From the last graph shown above, it is clear that in the problem cases, the trial energy updates far too slowly (I suspect due to retaining long time memory of higher prior energies via time averaging from the beginning of time), and has a long-lived bias relative to the asymptotic mean of the true unbiased local energy.

It would help to write down what the current algorithm is (math) and how warmupsteps features in that algorithm.

ye-luo commented 3 years ago

@jtkrogel the current algorithm is written in words and confirms what you suspected. The running avegrage used to update Etrial starts after the warmupsteps.

jtkrogel commented 3 years ago

The large population limit is unbiased. The correct time dependent trial energy is the large population limit of the mean local energy at the current point in time.

At finite population, we can't estimate this limit very well, but instead use information from prior times to reduce the variance of the estimated infinite population limit. Basic time averaging introduces biases in the estimated trial energy that are long lived (like 1/t while the actual trial energy varies more like exp(-a*t)). The only place where long time averaging and the unbiased trial energy are consistent is in the long time limit (i.e. after the ~exponential equilibration period is over). So long time averaging is unbiased (and also guaranteed to be stable) only after the equilibration period, which is not known beforehand, and so we end up with bias and instability in cases like these with this type of algorithm.

I believe other approaches have reset the time average occasionally, or used a box-car like windowing approach. We could try these, but they trade lower bias for higher variance.

An alternative that I think could be effective is to instead use the full time behavior to fit a simple exp(-a*t) model of the trial energy that is updated at each step as new data arrives. In a least squares fit, the asymptote would quickly track the long time mean (automatically excluding equlibration) and the transition into the asymptotic region (equilibration) would get an estimate of the large population limit lower in variance than single time or box-car time averaging and lower in bias than full time averaging. I think it would cover both regimes well automatically and is simple.

ye-luo commented 3 years ago

The large population limit is unbiased. The correct time dependent trial energy is the large population limit of the mean local energy at the current point in time.

At finite population, we can't estimate this limit very well, but instead use information from prior times to reduce the variance of the estimated infinite population limit. Basic time averaging introduces biases in the estimated trial energy that are long lived (like 1/t while the actual trial energy varies more like exp(-a*t)). The only place where long time averaging and the unbiased trial energy are consistent is in the long time limit (i.e. after the ~exponential equilibration period is over). So long time averaging is unbiased (and also guaranteed to be stable) only after the equilibration period, which is not known beforehand, and so we end up with bias and instability in cases like these with this type of algorithm.

I thought about the same thing and comes to the same conclusion that one cannot predict equilibration based on a quantity which can only be accessed reliably after equilibrium is reached.

I believe other approaches have reset the time average occasionally, or used a box-car like windowing approach. We could try these, but they trade lower bias for higher variance.

It is less favorable for the main stage which prefers to accumulate as much average as possible to calculate Etrial. windowing approach has another question how large the windows is. Prefer simpler things if we can find.

An alternative that I think could be effective is to instead use the full time behavior to fit a simple exp(-a*t) model of the trial energy that is updated at each step as new data arrives. In a least squares fit, the asymptote would quickly track the long time mean (automatically excluding equlibration) and the transition into the asymptotic region (equilibration) would get an estimate of the large population limit lower in variance than single time or box-car time averaging and lower in bias than full time averaging. I think it would cover both regimes well automatically and is simple.

I also feel fitting something like exp() is better here, although I'm not sure it is true exp() as we are adjusting Etrial at the same time. Since fit only works with sufficient data point but Etrial must be updated timely during warmup. I would propose the following.

In warmup stage, adjust Etrial based on Enow as we are doing now.

Having an intial warmupsteps (maybe 1/tau),
Once it is reached, make a fit of all previous steps.
Check if the fitted asymptonic value and current population average is within error of each other.
if 3 is no, increase warmupsteps and repeat 2. if yes, move to main stage.

in main stage, use the running average Eref for calculating Etrial as we do now.

In this way, QMCPACK printout how many warmupsteps are actually used and the postprocess can consume this info when handling scalar.dat.

jtkrogel commented 3 years ago

I like (1) and (2) as it will remove some of the bias from sub-leading exponential decay. I don't think (3) and (4) is needed.

If E_T(t) = c + bexp(-at), fit E_T using data from t0 to t and then Etrial = E_T(t) (use the current, not the asymptotic value). Redo the fit and Etrial update at each step (the model gives rapid update with decent fidelity during remaining equilibration changes after selected warmupsteps is complete). That's it.

Doing the approach with 3 and 4 is fine I think (but a little more complex). I think it could work well either way, but maybe try the simpler one first.

jtkrogel commented 3 years ago

After a video chat, @ye-luo and I have agreed on a good approach to try first involving a warmup period (w/o fitting) followed by the main phase (w/ exp fitting).

Thoughts welcome.

prckent commented 3 years ago

I think it would be helpful to write down the requirements -- even the most basic ones -- before developing a new scheme. There may be various opinions on what the requirements actually are. Unfortunately I am terribly tied up the next 48h.

Some questions/comments:

At what point do we require the population be <2xtargetpop? Currently we abort if this is reached. Maybe we should be lenient during the equilibration and allow any population.

I consider it a hard requirement that provided population limits aren't breached, that the trial energy updating will allow the ground state to be reached at long time, whatever the success of the equilibration phase. i.e. The trial energy should forget any equilibration settings in time and update fast enough that (say) carbon diamond and NiO work solidly. This also suggests a test of the algorithms.

A default equilibration period of 200 would work for ~ all runs. But if we have multiple DMC sections, how are we enabling/disabling equilibration in the later sections?

If "smart" functionality will be used to adaptively change equilibration period, needed variables should be written to files (along with the trial energy etc.) to enable perfect restarts in future. Or impose that the equilibration has to fit in 1 block etc. Otherwise we can not restart DMC during equilibration and obtain the same results as no restart.

Consider applying smart minimums, e.g. eqm steps * timestep must >= 1 a.u. to increase robustness.

Don't implement anything that would preclude varying the population or timestep for a potentially faster equilibration in future.

prckent commented 3 years ago

Maybe not user friendly enough, but perhaps DMC equilibration should be its own section?

jtkrogel commented 3 years ago

I think the requirements are mostly basic and point to what is needed to solve the problems we are seeing.

The only hard requirements I see are that the algorithm is stable (limited population fluctuation) and unbiased at intermediate to long times.

The current instability and bias we are seeing is due to a poor estimate of the time-dependent trial energy. A departure between the poorly estimated and ideal (large population) trial energy leads directly to both the large population growth and bias in the estimates.

The closer we can get the estimated and ideal time dependent trial energy functions, the less population fluctuation and bias we will encounter. The hard limit on the fluctuations beyond that is mediated only by the energy variance (trial wavefunction).

The proposal above is about the simplest model that captures well the time dependent form of the trial energy. We could try others, but the basic problem is infidelity in form. Including variable timestep in this or other approaches basically means that we need to track t_i as well as E_i, but that is about it.

jtkrogel commented 3 years ago

@ye-luo any thoughts on how we could start putting in the exponential fitting algorithm? Should we test the stability of using Enow always (this would not be a solution, but would approximately gauge the success of any possible solution as it trivially follows the local energy).

ye-luo commented 3 years ago

I prefer not adding "smart" handling of multiple DMC consecutive blocks at least as an initial implementation, hopefully not be needed at all.

Without "smart" warmup, warmupsteps input should be respected. warmupsteps=0 should shutdown any warmup treatment. With "smart" warmup, restart during an equilibration phase should be treated as a new DMC calculation. The overall energy trajectory should not diverge from the one without restart.

@jtkrogel Using Enow always (setting warmupsteps large) should give me enough dataset to try fitting.

ye-luo commented 3 years ago

@jtkrogel could you rerun these tests on Summit with develop-20211012 QMCPACK to see if the bugs which were fixed recently are the source of population explosion?

jtkrogel commented 3 years ago

Updates for most recent CPU code (batched drivers).

The divergence in the trial energy is gone: TrialEnergy_qmcpack_211110_2706d26_batched

But the divergences in walker population remain: NumOfWalkers_qmcpack_211110_2706d26_batched

prckent commented 3 years ago

Can we retitle this one as "Rapid population explosion in batched GPU DMC on Summit"?

Fixing this one seems best to do ahead of #3608. I has the easiest reproducer since the run immediately goes wrong.

jtkrogel commented 2 years ago

Reproducer files for the batched cpu code behavior shown above can be found at /ccs/proj/mat151/issue_3082 on Summit.

ye-luo commented 2 years ago

I tried the reproducer with both batched and classic driver. The population explosion issue is related to the population control. It is not batch specific. wu25 means using 25 warmup steps which is included in our dmc.dat hence the figure. classic-batch

jtkrogel commented 2 years ago

Agreed. I have a sequence of runs going over the weekend to try to focus more closely on long term stability.

ye-luo commented 2 years ago

This long run shows better what happens. In this calculation, the population is small ~400 walkers. The wave function is DFT without Jastrow. So the variance is large and the energy fluctuation is large. long

In the long run with 160 warm up steps. During warmup, trial energy follows the population average of the local energy. Once the warmup ends, trial energy is based on the time average of the population average of the local energy from then on. hence, trial energy lags behind local energy evolution. Because local energy converges from above. trial energy is generally higher than the value at infinite time. Hence the effect is growing population. By default, the "classic" branch off scheme and feedback 1 are week controls. For this reason, population grows more than population decreases. hence the population keeps exploding.

The good things is, there is no energy collapse. Note. fixed population turns fluctuating population into fluctuating weights. So the issue also need to be treated properly albeit in a slightly different form.

There are a few points to be address

warmup. how to automate its value.
how to control the population. It depends on the population size, the variance., the feedback control. When the wave function is good and the population is large. The issue can be negligible. When the WF quality is poor and the population is small, such problem just becomes very visible.

jtkrogel commented 2 years ago

Thanks Ye, I have other variations on runs like this that may provide additional information. I too suspect that the population control itself can be too weak. As we noted in our conversation, the energy variance can drive the instability. This case has a variance that is similar to 64 atom NiO, so developing approaches with good control here will likely help to stabilize other runs.

prckent commented 2 years ago

First, I'll assume we don't have any bugs here, which is still a bit of an assumption.

Second, if we are going to cook up a new scheme, how about one where we change to a dynamic feedback "constant" based on a running average of the population, only updating when outside a particular window, say +-50% of target population. This in turn would raise the question of whether we can have too strong a population control if the equilibration (warmup) is not set sufficiently long. However I would suggest fixing that if/when demonstrated to be a real problem. A run with high variance/poor starting energy likely needs strong control...

jtkrogel commented 2 years ago

This run is high variance, but not higher than 64 atom NiO with a Jastrow. I will post in a bit some data for this system when the warmup exceeds the DMC equilibration time.

jtkrogel commented 2 years ago

Below are results for a run with 1000 steps of warmup, well longer than the DMC equilibration period. During the "warmup" period, the trial energy follows the local energy closely and the run is stable. At the end of the warmup period (when trial energy averaging begins with no history), the trial energy is close to the mean local energy by chance and still the population diverges.

wu1000_ts0 01_energy

wu1000_ts0 01_population

Other runs that start (by chance) with the trial energy further from the local energy mean at the end of the warmup are even less stable. One possibility is to average the trial energy for a while before starting to use it. Still, the overall fluctuations in energy here outstrip the population control algorithm's ability to contain the divergences.

Since this likely can happen in real runs due to sudden local energy spikes, it is likely worth looking into population control algorithms with dynamic/responsive control strengths as Paul suggests.

I will perform and post other runs with stronger population control and somewhat larger population sizes.

ye-luo commented 2 years ago

Note that during warmup, the strictest population control is used with feedback = 1/tau. https://github.com/QMCPACK/qmcpack/blob/28b8ba78fc7f4d195b2d6821659456c4faa436b2/src/QMCDrivers/SimpleFixedNodeBranch.cpp#L340 This is the reason that population during warmup is very stable around the initial size.

jtkrogel commented 2 years ago

Do we have any evidence that stricter control (as in 1/tau) leads to noticable population control bias? It seems like having a default that at least does not vary with tau is asking for it.

I think for this reason, David's codes formulated the feedback as 1/(g*tau), where g was the number of generations (steps) you would expect the population to return to it's target, on average, with g being exposed as an input parameter.

prckent commented 2 years ago

We have orders of magnitudes more electrons than in the past, and likely higher variance. Still, if you know of any older publications we can look at, we should look at them.

Are we are fully convinced that the problems are entirely consistent with population control failures? e.g. That we don't have stuck walkers or neighbor list bugs.

ye-luo commented 2 years ago

There is a correlation among population size, time step and variance. Imagine a population size is 2, tau is 0.1 and the variance is 100, the bias is clearly very visible. For a given variance, the population size must be large enough to make its bias small enough than the time step error.

If feedback is chosen 1/(g*tau), then the population control bias depends on g. The number of steps to decorrelation depends on tau.

@prckent 1) we are running CPU only 2) the behavior is consistent with both classic and batched driver. 3) stuck walker or bug usually cause segfault or energy collapse which doesn't show here.

jtkrogel commented 2 years ago

I think the observed behavior is consistent with population control failures. In the event of e.g. stuck walkers, if we managed to stabilize the population, we would observe the local energy sliding down into a new lower equilibrium. The run here can be stabilized with strong enough control and no new lower minimum is apparent. The "event" that drives the instability here is a sudden switch to loose population control.

Still, this does raise the classic issue of stability (tight population control) vs accuracy (population control bias). I like the idea of introducing "guardrails", or population levels where tighter control kicks in until the population returns to its target. To avoid bias, these likely short sections of tigher control could then be excised from the production data (in post-processing, or via other schemes as desired).

jtkrogel commented 2 years ago

Using a feedback of 4 stabilizes the population (barely). The behavior is not very good though:

(Timestep is 0.02, warmup is for 350 steps, ~400 walkers)

wu350_ts0 02_fb4_nw400_energy

wu350_ts0 02_fb4_nw400_population

jtkrogel commented 2 years ago

Here is the same calculation but with a feedback of 50 (1/tau = 1/0.02 = 50). It looks to me that the statistical behavior of the population is distinctly different before and after step 350, but it shouldn't be based on Ye's description of the warmup population control:

wu350_ts0 02_fb50_nw400_energy

wu350_ts0 02_fb50_nw400_population

ye-luo commented 2 years ago

The population controls is also affected by TrialEnergy which is calculated differently in warmup and after.

QMCPACK / qmcpack