QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
298 stars 139 forks source link

Better restart capability for interrupted runs #5232

Open camelto2 opened 4 hours ago

camelto2 commented 4 hours ago

Is your feature request related to a problem? Please describe.

It looks like there is batched restart support capability in the code, which is enabled by the .cont.xml file and the .config.h5 files. The cont.xml file is basically just copy/paste of the original input file except that it now includes an mcwalkerset for restart. The cont.xml is also only written for the last series, and only if the run actually finishes.

The way the restarts and cont.xml are currently written, it looks like they are basically written to enable adding more statistics from a fully completed run. If you are running and you hit the wallclock limit, the cont.xml file is never written so restarting and continuing from the current series isn't as straightforward.

Describe the solution you'd like instead of only writing cont.xml at the end, and also having the cont.xml file include an exact copy of all the VMC/DMC runs, it would be nice if each series wrote its own cont.xml at the beginning and only included the driver that the series corresponded to. That would enable both adding more data to each series if you need it for better statistics, and it would enable restarting if the run is interrupted by wallclock limits.

For example, I tend to have < vmc > (s000) < dmc tstep1 > (s001) < dmc tstep2 > (s002) < dmc tstep3 > (s003) < dmc tstep4 > (s004)

where vmc is a fully converged VMC run, tstep1 is a large timestep for equilibration, and tsteps 2-4 are subsequently smaller timesteps used for extrapolation.

At the start of each driver, it could write s000.cont.xml with the corresponding < mcwalkerset fileroot="s000" > and the s000.cont.xml ONLY had the < vmc > section in it. The s001.cont.xml would be written once we start the first DMC, and it would have the < mcwalkerset fileroot="s001" > and only the < dmc tstep1 > driver in it. And so on and so forth.

This way each series would have a *.cont.xml file which only continues with its own driver from its current walkerset. My current issue is that I had a run that finished all of my VMC and series 001 002 003, but the s004 only got through 2-3 blocks and hit the wallclock limit. If the s004.cont.xml was appropriately written, I could have a file to continue just that series from. As it currently stands, I had to do a lot of scripting to enable what I want.

Also, if we have a run where all of them finished successfully, but we need to add more statistics, you could just restart each of them and they would continue on from their own respective walkersets.

Describe alternatives you've considered

Maybe nexus could something like this as well

Additional context Add any other context or screenshots about the feature request here.

ye-luo commented 3 hours ago

I like this direction. Some cleanup is definitely needed. We need to define a the continuation file serves. 1) run more statistics. 2) continue running qmcpack by rerunning the incomplete series. I feel the current cont.xml serves more like this failure recovery mode.

I saw one issue in the proposed scheme

The s001.cont.xml would be written once we start the first DMC, and it would have the < mcwalkerset fileroot="s001" > and only the < dmc tstep1 > driver in it. And so on and so forth.

if DMC run got killed, we won't have any RNG seed file and configuration file. There is no way to continue. Thus cont.xml should be written when a series completes not at the beginning.

camelto2 commented 3 hours ago

The case I currently care about is if one of the series gets killed by wallclock time. If I'm understanding you correctly, you need both the random.h5 and the config.h5 to properly continue a run. So if s004 got killed by wallclock, there isn't a clean way to pick up where I left off on that series?

jtkrogel commented 2 hours ago

While we are wishing, I would like for it to be even simpler: just modify the original input file by setting a single parameter

<parameter name="restart_at_series"> 2 </parameter>

QMCPACK would simply know which files to look for based on this request.

This is similar in spirit to the ease of use offered by Quantum Espresso, where one just states restart = .true..