Diffusion Monte Carlo trace anomaly in batched QMCPACK code

djstaros commented 1 year ago

Describe the bug A Diffusion Monte Carlo calculation using the batched (real build CPU) version of QMCPACK 3.16.0 on ALCF's Polaris resulted in nearly periodic dips in the DMC energy trace which were on the order of 5 Ha. This issue is not present in an otherwise identical calculation using the legacy (real build CPU) version of QMCPACK 3.16.9 on NERSC's Perlmutter.

To Reproduce

If necessary, install QMCPACK 3.16.0 to Polaris from here
Run QMCPACK 3.16.0 on ALCF's Polaris using the attached input files in the location bug_inputs/dmc_3160_batched_polaris and the orbitals specified in the NOTE below.
Plot the energy trace which results using the command: qmca -t -q e *.s001.scalar.dat

NOTE: The QP2QMCPACK.h5 file containing the orbitals is in a public Google Drive link here due to the large size of 98 MB.

Description of the system An 8-atom primitive cell of ferromagnetic monolayer CrI3 with more than 20Å of vacuum in the z-direction, with 38 up spin electrons and 32 down-spin electrons. Used ppn boundary conditions and 4,4,1 periodic images. The orbitals are a single-determinant of natural orbitals (NO's) obtained from three iterative cycles of CIPSI -> NO's and represented in a GTO basis set.

Expected behavior It is expected to see the same trace problem as in the attached photo bug_inputs/trace_dmc_3160_batched_polaris.png. Compare this to regular legacy Perlmutter behavior (bug_inputs/trace_dmc_3169_legacy_perlmutter.png).

Inputs and photos of issue bug_inputs.zip

ye-luo commented 1 year ago

Could you run batched driver with the CPU build on Polaris and report how it compares to your perlmutter CPU legacy driver runs?

prckent commented 1 year ago

Please can you add a description of the system here.

djstaros commented 1 year ago

@ye-luo Do you mean batched driver with CPU build on Perlmutter?

djstaros commented 1 year ago

@prckent My apologies for missing this. The issue has been updated with the system details.

ye-luo commented 1 year ago

@ye-luo Do you mean batched driver with CPU build on Perlmutter?

I thought you were using the offload build on Polaris. Actually no. So please use batched driver with the CPU build you have on Perlmutter.

ye-luo commented 1 year ago

@djstaros the energy drops are very suspicious. Could you test two runs with batched driver in CPU build using the latestest QMCPACK develop branch on perlmutter. In one run, add <parameter name="crowd_serialize_walkers"> yes </parameter>

djstaros commented 1 year ago

@ye-luo Sure, I will submit a run adding that flag, with the most updated develop branch.

In the mean-time, here is the result of using the batched driver with the real CPU code on Perlmutter...

eTrace_s001

...the energy drops are seen again. This run had about 8,400 total walkers, whereas the original Polaris one had about 21,000 and the legacy Perlmutter run had about 15,000 (just for the sake of stating this information).

prckent commented 1 year ago

Thanks Dan for the system description and the updates. The CPU result is very revealing: clearly there is a real and generic problem that can't be blamed on offloading to GPUs here. Since you didn't mention it, can we assume that everything looked OK on the VMC side? i.e. similar energies and variances.

jtkrogel commented 1 year ago

Hi Dan, Please can you upload here also the scalar.dat and dmc.dat files for the two runs? This will help to diagnose the issue further.

djstaros commented 1 year ago

@prckent Yes, the VMC trace behaves normally and the energies/variances were as expected for both the legacy and batched real CPU codes.

djstaros commented 1 year ago

@jtkrogel Good idea. I'm uploading here a zip with the scalar.dat and dmc.dat files from a.) the batched Polaris run, b.) the legacy Perlmutter run, and c.) the batched Perlmutter run.

Outputs bug_outputs.zip

djstaros commented 1 year ago

@ye-luo Yet to have a true final verdict, but your serial walkers suggestion on Perlmutter seems well behaved so far, with over 120 blocks ("samples" on the trace plots) already done and a working error bar of 1 mHa. Without this flag, the batched Perlmutter energies were already dipping by that number of blocks so this is good. I can upload the final results in probably about 1.5 hours (I'll just update this comment with the new information).

jtkrogel commented 1 year ago

The issue is reproduced in the batched code on Perlmutter. See below:

Batched Polaris dmc_polaris_batched

Legacy Perlmutter dmc_perlmutter_legacy

Batched Perlmutter dmc_perlmutter_batched

jtkrogel commented 1 year ago

It's most likely that the issue relates to the population control algorithm (which differs inexplicably between the legacy and batched codes).

It is also possible, but less likely, that the local energy differs between the two. @djstaros to rule this out, would you run VMC on Perlmutter with number of walkers and blocks matching the DMC above? From this we can check if there is any differing behavior in the energy components.

djstaros commented 1 year ago

@jtkrogel The VMC/DMC energies from the legacy Perlmutter and batched Perlmutter run using @ye-luo 's suggested fix in the DMC block ( yes ) are as follows:

Legacy Perlmutter Real CPU build LocalEnergy Variance ratio QP2QMCPACK series 0 -242.017522 +/- 0.003111 3.367914 +/- 0.036474 0.0139 QP2QMCPACK series 1 -242.305246 +/- 0.000605 3.496860 +/- 0.014910 0.0144

Batched Perlmutter Real CPU build with serialized walkers turned on for DMC only LocalEnergy Variance ratio QP2QMCPACK series 0 -242.022616 +/- 0.000735 3.394342 +/- 0.014140 0.0140 QP2QMCPACK series 1 -242.306972 +/- 0.000530 3.486293 +/- 0.005918 0.0144

As both the VMC and DMC energies are within error bars of one-another, it seems that the local energy does not differ between legacy and batched/serialized.

jtkrogel commented 1 year ago

Thanks @djstaros. Please also post the scalar.dat and dmc.dat files for the runs described above.

djstaros commented 1 year ago

@jtkrogel The legacy Perlmutter dmc.dat and scalar.dat are already posted above in Outputs/b_dmc_perl_legacy. I am attaching the "serialized" batched Perlmutter dmc.dat and scalar.dat files here.

Third output d_dmc_perl_batch_ser.zip

jtkrogel commented 1 year ago

Running with crowd_serialize_walkers=yes appears to fix the problem. It also introduces no slowdown in the cpu run.

@ye-luo what is the impact of this flag internally? Is there any reason "yes" is not the default for a cpu run?

ye-luo commented 1 year ago

crowd_serialize_walkers=yes forces the batched driver to call single walker APIs internally. On CPU, the performance difference should be minimal. What I feel here is a bug in the multi-walker specialized implementation.

jtkrogel commented 1 year ago

Yes, agreed. The multi-walker implementation needs looking into. Beyond that, I propose we need a PR to set the default to "yes" for cpu runs.

If there is no performance gain, I think a gain in safety is clearly a win that is well worth it. People will run into this problem by default if we don't.

ye-luo commented 1 year ago

Yes, agreed. The multi-walker implementation needs looking into. Beyond that, I propose we need a PR to set the default to "yes" for cpu runs.

If there is no performance gain, I think a gain in safety is clearly a win that is well worth it. People will run into this problem by default if we don't.

There are actual benefit not serializing walkers although it depends on the actual simulations. I don't feel good to force walker serialization by default.

jtkrogel commented 1 year ago

How about we make it default at least until the other code path is fixed?

I see no disadvantage to this at all, and there is currently real research time being wasted due to this issue.

prckent commented 1 year ago

Happily, experience says that bugs like this don't take long to fix once they are diagnosed.

Something to think about is what aspect of these runs are not captured by our current testing. e.g. We do have serialized vs non-serialized DMC tests that were designed to catch this very issue.

prckent commented 1 year ago

The population growth around the time of the anomaly indicates a stuck walker, or at least we are able to generate an anomalously low energy and slow moving one. The runs nearly recover but clearly should not be used.

One reassuring result is the agreement of the legacy and batched but serialized codes in DMC.

If we believe that the crowd_serialize_walkers=yes and =no runs use exactly the same move generation algorithm, then we have either bad numerics that can generate an unlikely configuration in the =no (multiwalker API) case or there is an occasional bug in the energy calculations in the multiwalker case.

Any disagreement or alternatives to this logic?

prckent commented 2 months ago

Commenting to remind ourselves that this was never fully solved. Keeping crowd_serialize_walkers is not a long term strategy (performance) and could just be hiding the problem given the limited statistics here...

ye-luo commented 2 months ago

It seems to me caused by the broken T-move fixed in https://github.com/QMCPACK/qmcpack/pull/4902 @djstaros will you be able to rerun the reproducer using /soft/applications/qmcpack/develop-20240425 on polaris

QMCPACK / qmcpack

Diffusion Monte Carlo trace anomaly in batched QMCPACK code #4648