QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
285 stars 135 forks source link

Driver-portable walker logging #5019

Closed jtkrogel closed 1 month ago

jtkrogel commented 1 month ago

Proposed changes

This PR introduces a simplified version of the legacy TraceManager that focuses solely on writing per-walker data to HDF5 throughout the run (no streaming/listener functionality). The new implementation, WalkerLogManager, is compatible with and implemented in both the legacy and batched drivers.

This implementation extends the original to include the ability to write full data for the highest, lowest, and median energy walkers on each rank to disk at every step. The intention of writing this abbreviated data is to enable inexpensive access to data from problematic walkers (e.g. population explosions) in production runs.

Capabilities

Below is an example generated from a short DMC run of diamond on 6 MPI ranks showing the energies of all walkers (blue), highest energy walkers (green), lowest energy walkers (red), median energy walkers (black), vs the mean energy from dmc.dat (magenta line):

walker_traces

Design

Tests

What type(s) of changes does this code introduce?

Does this introduce a breaking change?

What systems has this change been tested on?

Laptop

Checklist

Path out of WIP

Tasks for review round 3

Elements for reviewers to review

jtkrogel commented 1 month ago

NB: I will fix the failing unit test (complex estimators) in 30 min.

jtkrogel commented 1 month ago

Hold on merge for naming discussion for now. Will commit a naming change if the decision is made today.

PDoakORNL commented 1 month ago

There is a great deal to cover here. I'm just starting on the review. Please introduce one piece at a time from the most detailed up. I'd start with trace buffer and its unit tests. It needs a review on its own.

jtkrogel commented 1 month ago

Renaming applied. Ready to proceed.

PDoakORNL commented 1 month ago

That would be a big help to me as well. I have suggestions/changes needed for the lower level classes which would be better to address and test at that level. I.e. use of lower level hdf5 functions when there is already a way to easily pass containers.

Other issues relate the integration with existing higher level classes which are largely separate.

jtkrogel commented 1 month ago

See #5035

ye-luo commented 1 month ago

I have merged in develop and fixed cmake issues. Please pull before making further changes.

jtkrogel commented 1 month ago

I've addressed Peter's comment from #5035 requesting simplification of WalkerLogState.

jtkrogel commented 1 month ago

I've also tried Peter's suggestion to use f.writeEntry(buffer, "data") in WalkerLogBuffer::writeHDF. Making that replacement resulted in NaN's being written to the HDF files (picked up by integration test). Therefore I've kept the original implementation.

jtkrogel commented 1 month ago

Comments addressed. Ready to proceed.

ye-luo commented 1 month ago

I've also tried Peter's suggestion to use f.writeEntry(buffer, "data") in WalkerLogBuffer::writeHDF. Making that replacement resulted in NaN's being written to the HDF files (picked up by integration test). Therefore I've kept the original implementation.

You need h5d_append and we don't have a high-level API in hdf_archive right now. added issue #5038. WalkerLogBuffer::appendHDF can be a better name to avoid confusion.

ye-luo commented 1 month ago

Test this please

jtkrogel commented 1 month ago

Hi guys. I had thought this was going in given Ye's approval.

If there really is a short list of "must have" items still remaining for this PR after all these passes, please make a short, final list below and I will address them.

I need to get past this point and get back to other work, some of which depends on this code being in place.

PDoakORNL commented 1 month ago

Reverting the whitespace OperatorBase.cpp change to reduce the number files is the only must have.

ye-luo commented 1 month ago

Test this please