Closed hongmoxian closed 3 years ago
Updated by myself a few moments later...
I have the error "~/.conda/envs/deepmd-1.2.2/bin/dp_train: No such file or directory" in the train.log file when I use the command "dpgen run *.json" and I can`t find the ralated setting in the machine.json file . Only the command" dp" exist in this path. Thanks!
I've found the workaround. I think you must have followed the example in the Github documentation, but actually the documentation isn't maintained well, which means the example has been out-of-date. A recent issue may inspire you to fix this problem.
BTW, here is my machine.json which worked. This problem is proved to be missing "command" keyword:
`{
"train": [
{
"machine": {
"batch": "slurm",
"work_path": "/public/home/hfcas_user39/cjding/test-dpgen/dpgen_out/train"
},
"resources": {
"numb_gpu": 0,
"numb_node": 1,
"task_per_node": 4,
"name":"dp",
"partition": "cpu",
"exclude_list": [],
"source_list": [],
"module_list": [],
"time_limit": "23:0:0"
},
"command":"dp",
"group_size": 2
}
],
"model_devi": [ { "machine": { "batch": "slurm", "work_path": "/public/home/hfcas_user39/cjding/test-dpgen/dpgen_out/model_devi" }, "resources": { "numb_gpu": 0, "task_per_node": 4, "partition": "cpu", "name":"lmp", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "23:0:0" }, "command": "lmp", "group_size": 2 } ],
"fp": [
{
"machine": {
"batch": "slurm",
"work_path": "/public/home/hfcas_user39/cjding/test-dpgen/dpgen_out/model_devi"
},
"resources": {
"numb_gpu": 0,
"task_per_node": 4,
"numb_node": 1,
"with_mpi": false,
"name":"fp",
"exclude_list": [],
"source_list": ["module purge"],
"module_list": [
"compiler/intel/2017.5.239",
"mpi/hpcx/2.4.1/intel-2017.5.239"
],
"time_limit": "12:00:0",
"partition": "cpu",
"_comment": "that's All"
},
"command": "srun --mpi=pmix_v3 /public/software/apps/vtst/5.4.4/hpcx-2.4.1-intel2017/vasp_std",
"group_size": 1
}
]
}`
You shall submit your task not by traditional SLURM script, but directly use "dpgen run
You use old version of machine.json, which is incompatible. Delete "deepmd_path" in machine.json. You can refer to https://github.com/AnguseZhang/dpgen/blob/devel/examples/machine/DeePMD-kit-1.x/machine-slurm-qe.json and see the explanations.
You should also use a compatible param.json. Before you see an example, you should check its version. For DeePMD-kit >=1.0 , you can refer to https://github.com/AnguseZhang/dpgen/blob/devel/examples/run/dp1.x-lammps-vasp/CH4/param_CH4_deepmd-kit-1.1.0.json
I have the error "~/.conda/envs/deepmd-1.2.2/bin/dp_train: No such file or directory" in the train.log file when I use the command "dpgen run *.json" and I can`t find the ralated setting in the machine.json file . Only the command" dp" exist in this path. Thanks!