deepmodeling / dpgen

The deep potential generator to generate a deep-learning based model of interatomic potential energy and force field
https://docs.deepmodeling.com/projects/dpgen/
GNU Lesser General Public License v3.0
304 stars 174 forks source link

undefined key `load_ckpt` is not allowed in strict mode or can't start the MD step #701

Closed Joey-zhangcy closed 2 years ago

Joey-zhangcy commented 2 years ago

Dear DP users, i'm new to deep modeling. When i followed the CH4 case on the website, i met some problems, when the key words load_cpkt in my param.json, the dp train will stop and report undefined key **load_ckpt is not allowed in strict mode**" when i delete the load_ckpt and restart the training, the first step-dp will finished, but there is no such model_devi file. i guess the lammps can't load the potential function from the first dp train step. but i can't figure it out? could some one help?

Any suggestions or comments will be much appreciated.

Thanks a lot.

All the best, Joey train (2).log

njzjz commented 2 years ago

The model_devi file should be generated from the second step.

Joey-zhangcy commented 2 years ago

Dear njzjz: Much thanks to your kind reply. It is my negligence not to elaborate the problem carefully. I perfrom the dpgen run rather than dp train input.json. The model_devi file should be generated. Here is my param.json file. Could you please give me some advice? I would be much appreciated.

Thanks a lot.

All the best, Joey param.zip

taipinghu commented 2 years ago

Dear njzjz: Much thanks to your kind reply. It is my negligence not to elaborate the problem carefully. I perfrom the dpgen run rather than dp train input.json. The model_devi file should be generated. Here is my param.json file. Could you please give me some advice? I would be much appreciated.

Thanks a lot.

All the best, Joey param.zip

"but there is no such model_devi file. " this is because the definition of "sys_configs" is not correct. you should change the json file as this: "sys_configs_prefix": "/public/home/zhangchengyi/lammps-practice/tutorials/tutorials-master/EXAMPLES/dpgen_cloudserver/CH4.POSCAR.01x01x01/01.scale_pert/sys-0004-0001/scale-1.000/", "sys_configs": [ [ "00000/POSCAR" ], [ "00001/POSCAR" ] ],

Joey-zhangcy commented 2 years ago

Dear taipinghu: Much thanks for your advice, but it seems that such change doesn't work. I still appreciate for your help.

Thanks a lot.

All the best, Joey

taipinghu commented 2 years ago

Dear taipinghu: Much thanks for your advice, but it seems that such change doesn't work. I still appreciate for your help.

Thanks a lot.

All the best, Joey

I think uploading all input files is more efficient to fix your error.

Joey-zhangcy commented 2 years ago

Dear taipinghu: Much thanks for your help, here are all my input files while all the POSCARs are obtained from the examples in the dpgen. I would appreciate it if you could take the time to check it out!

Thanks a lot.

All the best, Joey all.zip

taipinghu commented 2 years ago

Dear taipinghu: Much thanks for your help, here are all my input files while all the POSCARs are obtained from the examples in the dpgen. I would appreciate it if you could take the time to check it out!

Thanks a lot.

All the best, Joey all.zip

what is the error in dpgen run ?

Joey-zhangcy commented 2 years ago

Dear taipinghu: Here is my output files ,it seems that i cann't upload the filefoler. I set the training steps as 2000 in the param.json. The first step training seems to finish by the lcurve.out, then it stopped. Thanks a lot.

All the best, Joey output.zip

taipinghu commented 2 years ago

Dear taipinghu: Here is my output files ,it seems that i cann't upload the filefoler. I set the training steps as 2000 in the param.json. The first step training seems to finish by the lcurve.out, then it stopped. Thanks a lot.

All the best, Joey output.zip

the error is caused by dpdispathcer? you can change to the work dir and check if it works normmally.

Joey-zhangcy commented 2 years ago

Dear taipinghu: The dpdispatcher seems to be generated by the dpgen. I followed your advice to move the file to another folder but dpgen still doesn't work. Thanks a lot.

All the best, Joey

Joey-zhangcy commented 2 years ago

Dear taipinghu: I found an interesting thing. i retype the dpgen run param.json machine.json in the terminal. The code rerun and the model_devi appear but err shows. Could you give me some advice? err.txt Thanks a lot.

All the best, Joey

taipinghu commented 2 years ago

Dear taipinghu: I found an interesting thing. i retype the dpgen run param.json machine.json in the terminal. The code rerun and the model_devi appear but err shows. Could you give me some advice? err.txt Thanks a lot.

All the best, Joey

I think first you should check if the path (sys_configs_prefix and sys_configs in parameter json file) is correct. You can goto 01.model_devi dir to see if some dirs like task.000.00000 is exist.

Joey-zhangcy commented 2 years ago

Dear taipinghu: I went through the path(sys_configs in parameter json file) by the cd command. Nothing went wrong. There are just four .pb file, a cur_job.json, and an empty filefolder confs in the 01.model_devi filefolder. No such file named task.000.0000. Thanks a lot.

All the best, Joey

taipinghu commented 2 years ago

Dear taipinghu: I went through the path(sys_configs in parameter json file) by the cd command. Nothing went wrong. There are just four .pb file, a cur_job.json, and an empty filefolder confs in the 01.model_devi filefolder. No such file named task.000.0000. Thanks a lot.

All the best, Joey

please check carefully again, I still think the path of sys_configs is incorrect. you can manually write a simple script to read param.json file and then print the sys_figs.

Joey-zhangcy commented 2 years ago

Dear taipinghu: Thanks for your advice.while there is a data.init filefolder generated automatically in the iter.000000. All the sys_configs files are listed inside. I thought if the path of sys_configs is incorrect and the dpgen cannot find these file. These sys_configs files couldn't be listed here?

Thanks a lot.

All the best, Joey

taipinghu commented 2 years ago

Dear taipinghu: Thanks for your advice.while there is a data.init filefolder generated automatically in the iter.000000. All the sys_configs files are listed inside. I thought if the path of sys_configs is incorrect and the dpgen cannot find these file. These sys_configs files couldn't be listed here?

Thanks a lot.

All the best, Joey

(1) data.init filefolder is originated from init_data_prefix and init_data_sys in param, rather than sys_configs_prefix and sys_configs. (2) as mentioned above, you find an empty folder in confs in 01.model_devi. This confs dirs save the lammps lmp format files, which are converted from POSCAR stated in os.path.join(sys_configs_prefix, sys_configs).

Joey-zhangcy commented 2 years ago

Dear taipinghu: I really appreciate your help. No matter how i change the path style of the original file, it didn't work. I download the input-sys_configs from Internet, and change the path, it works. By the way, can I ask you one more question, that is, every time I run nohup dpgen run param.json machine.json, it stops after one step, and I need to retype the command in the terminal before I can run the next step, do you have any idea about this problem? Thanks a lot.

All the best, Joey

taipinghu commented 2 years ago

Dear taipinghu: I really appreciate your help. No matter how i change the path style of the original file, it didn't work. I download the input-sys_configs from Internet, and change the path, it works. By the way, can I ask you one more question, that is, every time I run nohup dpgen run param.json machine.json, it stops after one step, and I need to retype the command in the terminal before I can run the next step, do you have any idea about this problem? Thanks a lot.

All the best, Joey

as for your first question, you shouled know that the workflow of dpgen contains three steps, i.e. 00.train, 01.model_devi, 02.fp. each step also contains three steps, e.g., make_train, run_train and post train. you can read the record.dpgen file to get the current step. It will be helpful for you to fix the error.

dpgen can automatically run above steps, unless you write a incorrect machine.json file (depend on your schedule system).

Joey-zhangcy commented 2 years ago

Dear taipinghu: Thank you very much for your help, I will adjust the parameters carefully。 Thanks a lot.

All the best, Joey

AnguseZhang commented 2 years ago

It seems that this problem has been solved, so I'll close this issue. If you have any questions yet, you can reopen this issue or create a new issue.