deepmodeling / dpgen

The deep potential generator to generate a deep-learning based model of interatomic potential energy and force field
https://docs.deepmodeling.com/projects/dpgen/
GNU Lesser General Public License v3.0
298 stars 173 forks source link

I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000 #1645

Open marcog2020460 opened 5 days ago

marcog2020460 commented 5 days ago

Summary

I am using dpgen to train an ensemble of 4 models with 3,000,000 steps (stop_batch": 3000000), but the hpc cluster queues have a time limit of two days only for every run; how can I restart the training from the last step 1,000,000 in order to finish the remaining 2,000,000 steps. I do not want that my training starts from zero again.

-------------------------iter.000000 task 03-------------------------- : -------------------------iter.000000 task 04--

Please help me, I look for answers on the internet, before submitting this request . How can I modify the param.json file.

DP-GEN Version

v0.12.0

Platform, Python Version, etc

slurm hpc cluster

Details

"training": { "_set_prefix": "set", "stop_batch": 3000000, "_batch_size": "auto", "disp_file": "lcurve.out", "disp_freq": 1000, "numb_test": "5%", "save_freq": 1000, "save_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json", "_comment": "that's all" }

njzjz commented 4 days ago

Restarting is supported by default.