I am using dpgen to train an ensemble of 4 models with 3,000,000 steps, but the hpc cluster queue only gives me time for 2 days how can I restart the training from the last step 1,000,000 #1645
I am using dpgen to train an ensemble of 4 models with 3,000,000 steps (stop_batch": 3000000), but the hpc cluster queues have a time limit of two days only for every run; how can I restart the training from the last step 1,000,000 in order to finish the remaining 2,000,000 steps.
I do not want that my training starts from zero again.
Summary
I am using dpgen to train an ensemble of 4 models with 3,000,000 steps (stop_batch": 3000000), but the hpc cluster queues have a time limit of two days only for every run; how can I restart the training from the last step 1,000,000 in order to finish the remaining 2,000,000 steps. I do not want that my training starts from zero again.
-------------------------iter.000000 task 03-------------------------- : -------------------------iter.000000 task 04--
Please help me, I look for answers on the internet, before submitting this request . How can I modify the param.json file.
DP-GEN Version
v0.12.0
Platform, Python Version, etc
slurm hpc cluster
Details
"training": { "_set_prefix": "set", "stop_batch": 3000000, "_batch_size": "auto", "disp_file": "lcurve.out", "disp_freq": 1000, "numb_test": "5%", "save_freq": 1000, "save_ckpt": "model.ckpt", "disp_training": true, "time_training": true, "profiling": false, "profiling_file": "timeline.json", "_comment": "that's all" }