QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
294 stars 137 forks source link

Strange behavior of the DMC trial energy #789

Open jtkrogel opened 6 years ago

jtkrogel commented 6 years ago

During a production Mira run with the most recent SoA version of the code I noticed large spikes/re-equilibration when identical DMC sections were run in series (no change in timestep). This behavior appears to relate to how the trial energy is reset and it undesirable as it leads to unnecessary equilibration and waste of data. I don't recall observing similar behavior in earlier versions of the AoS code (e.g. v3.1.1).

The last four sections of DMC data shown in the figures below use the same timestep (0.005/Ha) and each includes 10 warmup steps.

Behavior of the local energy. Spikes are visible at the onset of each DMC section. This is not visible for individual twists, but it is for the twist average. local_energy

Behavior of the trial energy. The trial energy is apparently reset in a way that lacks continuity between the identical DMC sections. trial_energy

jtkrogel commented 6 years ago

Additionally, the trial energy above appears to be undershooting the local energy in the first visible DMC section (timestep=0.02/Ha) resulting in beyond target growth of the walker population: num_walkers

prckent commented 6 years ago

Agreed that this is a bug. Is this the development version or the latest release?

ye-luo commented 6 years ago

Is your number of walkers much larger than the number of threads? I mentioned a bug to you during ECP meeting that some initialization operation of the DMC driver will destroy the equilibrated population.

jtkrogel commented 6 years ago

It is a development version, I believe. Executable location on Mira: /soft/applications/qmcpack/current/build_Clang++11_cplx_SoA/bin/qmcpack. Date run: April 12.

Self-reported data from QMCPACK log output: Git branch: develop Last git commit: 30eca5db445fa42925f8e9d9b03a5a85af0aeab0 Last commit date: Fri Feb 9 15:42:32 2018 -0500

@ye-luo samplesperthread=1.7

ye-luo commented 6 years ago

Try a run with samples/walkers per thread <= 1

jtkrogel commented 6 years ago

What will this show? The runs are not totally cheap, ~1.5 hrs on 8192 Mira nodes.

ye-luo commented 6 years ago

It probably will show no bumps and confirms that you are affected by the >1 bug.

jtkrogel commented 6 years ago

Very well. If there are no objections I will submit a new job tomorrow.

ye-luo commented 6 years ago

Are you using reconfiguration=no?

jtkrogel commented 6 years ago

I do not set reconfiguration either way.

ye-luo commented 6 years ago

Here in the first section, it takes about 100-150 steps to converge and the later sections needs 50 steps. It is necessary to set the warmupSteps close to that. The trial energy is determined as the current iteration localenergy population average during warmup and the history average of the population averaged value after the warmup. So the trial energy of your first section is very much biased due to a small warmup.

From the second to the last block, there is always a bump. I guess that is affected by the nwalker>nthread issue.

jtkrogel commented 6 years ago

The trial energy should be determined in such a way that an identical run can follow an already equilibrated run with no warmup steps (anything more is wasteful and prone to be gotten wrong). More broadly, the behavior of the of the followon identical run should not depend on warmupsteps.

Somewhere (overleaf?) lets write out the formulas used to determine the trial energy and then consider issues of continuity. The procedure that gets identical followon runs right should also give small reequilibration times for small changes to the timestep in subsequent DMC sections.

I've run DMC in the way set out in this particular run for years without issue.

prckent commented 6 years ago

Other codes manage to get this right.

ye-luo commented 6 years ago

If the first section is equilibrated, then the following ones should take zero warmup. I'm not convinced that the first one is equilibrated from the the plot.

ye-luo commented 6 years ago

It seems that the first section has a different time step than the later blocks. So the second section needs some warmup steps to further equilibrate the system based on the smaller time step. From the third, it should allow zero warmup. It seems to me the current bad bump are due the nwalker>nthread problem.

jtkrogel commented 6 years ago

The first two sections are run with a timestep of 0.02/Ha (25 and 400 total steps respectively), the last four use 0.005/Ha (all with 500 total steps).

The first very short one is not equilibrated upon completion but should be continuous with the second (it is not), and certainly 400 steps at 0.02 timestep is enough to equilibrate (I've run VO2 many times now with a supercell of this size).

Regarding warmupsteps: a run with 100 warmup steps and 300 non-warmup steps should behave identically to one with 0 warmupsteps and 400 non-warmup steps (perhaps apart from what data is written to scalar.dat), otherwise we always introduce discontinuities at the end of the warmup period as observed here since warmup steps and non-warmup steps are treated differently from the point of view of the trial energy (a discontinuity).

ye-luo commented 6 years ago

a run with 100 warmup steps and 300 non-warmup steps should behave identically to one with 0 warmupsteps and 400 non-warmup steps (perhaps apart from what data is written to scalar.dat) I don't agree. The DMC warmup and VMC warmup are different.

ye-luo commented 6 years ago

Anyway, more clear documentation is needed in the manual.

jtkrogel commented 6 years ago

This is an algorithmic problem, not a fundamental one. What you are saying is that the user needs to know when the equilibration period will end prior to running, or else he/she will get biased results. This is points out at least one major problem with the current algorithm (since the first is not generally not knowable, the second is practically guaranteed).

The only way we can document how to properly run this algorithm is to instruct the user to run once, throw away the results, and then run again correctly once the equilibration is known.

An algorithmic change that does not produce discontinuities or require user prescience (e.g. a histogram that "forgets") is warranted.

jtkrogel commented 6 years ago

Job resubmitted on Mira now with samplesperthread=0.7 and otherwise identical.

jtkrogel commented 6 years ago

Following PR #797, discontinuities are no longer visible in the local energy. Significant discontinuities remain in the trial energy, though it now recovers faster at the start of a new DMC section leading to shorter, but still present, re-equilibration times. Trial energy discontinuities also remain visible as discontinuities in the walker population.

See below for plots of these quantities performed as a rerun of the above but with a version of the code following PR #797.

Local energy local_energy2

Trial energy trial_energy2

Walker population num_walkers2

ye-luo commented 6 years ago

The behaviour of TrialEnergy is expected. 5 warmupsteps for the later sections are added in the input. This triggers the warmup infrastructure assuming the previous section can be anything (VMC, DMC with a different time step...).

jtkrogel commented 6 years ago

So is the recommendation to set warmupsteps explicitly to zero in these cases? Would this eliminate the discontinuities in the trial energy?

It would still be nice if some info from the prior section was forwarded to the following one so that more appropriate action could be taken by the warmup machinery.

ye-luo commented 6 years ago

I agree. It needs to be smarter. If the parameters are fully identical between sections, warmup should go zero automatically. The current behaviour of warmupSteps = 0 is unknown to me.

jtkrogel commented 3 years ago

Issues relating to warmup and weight discontinuities remain.