Open jtkrogel opened 3 years ago
Files from the run. The crash occurs for twist 008. varianceExpl.zip
Behavior of the kinetic energy:
Location of the trial energy drop in the first DMC series (zoomed in, see the downward slanting line):
Has an underconverged basis set been considered? e.g. the problem is reproducible with doubled grids, or the same hybrid parameters have been used successfully with a similar electronic structure?
Dan is trying a run with a larger meshfactor. The same hybrid rep parameters are used for each of the nine twists. The variance/energy ratio indicates respectable quality generally for the all twists, so I don't think the problem is broadly based (i.e. not due to the general mesh, but more likely to be a small region of phase space):
>qmca -q ev *s000*scalar*
LocalEnergy Variance ratio
dmc.g000 series 0 -2176.465364 +/- 0.025357 55.713636 +/- 1.058126 0.0256
dmc.g001 series 0 -2176.454010 +/- 0.019833 54.858046 +/- 0.310613 0.0252
dmc.g002 series 0 -2176.429570 +/- 0.015292 54.685930 +/- 0.332021 0.0251
dmc.g003 series 0 -2176.446316 +/- 0.015674 54.932212 +/- 0.490942 0.0252
dmc.g004 series 0 -2176.445535 +/- 0.021688 53.899520 +/- 0.279747 0.0248
dmc.g005 series 0 -2176.419633 +/- 0.019214 54.384172 +/- 0.246866 0.0250
dmc.g006 series 0 -2176.464741 +/- 0.020904 55.483628 +/- 0.592998 0.0255
dmc.g007 series 0 -2176.434707 +/- 0.017068 54.000493 +/- 0.287997 0.0248
dmc.g008 series 0 -2176.393043 +/- 0.016223 54.034666 +/- 0.341088 0.0248
Thankfully, we have the coordinates of walkers that have sampled this portion of the phase space available. With these, the bug should be able to be isolated rather quickly on a workstation.
I think the smoothing scheme used between atomic and interstitial region is not robust. Has the cutoff_radius been tuned?
Please note that the full fileset for the run (including wavefunction and checkpoint files) are now available at OLCF, see issue header.
What process do you mean when you say "tuned"?
Describe the bug A DMC population explosion was reported to me by Dan Staros. The typical culprit for these is a "stuck walker" (constant rejection in an area of low potential energy within the core region of a non-local pseudopotential). In this case, the potential energy remains nearly constant. Instead, the kinetic energy of a single walker suddenly falls from about 1000 Ha to 200 Ha, resulting in a population explosion in a small part of the phase space.
This points to issues in the calculation of the kinetic energy of the trial wavefunction. Since this has not been seen before to my knowledge for pure B-spline based Slater-Jastrow wavefunctions, the most likely culprit is the lesser used hybrid atomic orbital code that was used in this case.
To Reproduce It is not yet known if the behavior is easily reproduced. If it is, I expect the production of a checkpoint at the end of the first DMC series that samples this part of the phase space could be used to isolate a particular walker configration in this region and then locally debug the issue on a workstation by comparing the trial wavefunction laplacian at this coordinate with and without employing the hybrid method.
Full dataset (all inputs/outputs) from the run in question are available on Summit at:
Expected behavior The local kinetic energy value should remain in the vicinity of 1000 Ha rather than 200 Ha.
System: Cori at NERSC.