hongmoxian commented 3 years ago

I have the error "~/.conda/envs/deepmd-1.2.2/bin/dp_train: No such file or directory" in the train.log file when I use the command "dpgen run *.json" and I can`t find the ralated setting in the machine.json file . Only the command" dp" exist in this path. Thanks!

DingChangjie commented 3 years ago

I also encountered this problem when using dpgen with deepmd-kit 1.3.3. However this issue occurs exclusively on HPC machines (Sugon, 中科曙光计算云). With exactly same installation, things work well on a single local compute node.

Updated by myself a few moments later...

I have the error "~/.conda/envs/deepmd-1.2.2/bin/dp_train: No such file or directory" in the train.log file when I use the command "dpgen run *.json" and I can`t find the ralated setting in the machine.json file . Only the command" dp" exist in this path. Thanks!

I've found the workaround. I think you must have followed the example in the Github documentation, but actually the documentation isn't maintained well, which means the example has been out-of-date. A recent issue may inspire you to fix this problem. BTW, here is my machine.json which worked. This problem is proved to be missing "command" keyword: `{ "train": [ { "machine": { "batch": "slurm", "work_path": "/public/home/hfcas_user39/cjding/test-dpgen/dpgen_out/train" }, "resources": { "numb_gpu": 0,
"numb_node": 1, "task_per_node": 4, "name":"dp", "partition": "cpu",
"exclude_list": [], "source_list": [], "module_list": [], "time_limit": "23:0:0" }, "command":"dp", "group_size": 2 } ],

"model_devi": [ { "machine": { "batch": "slurm", "work_path": "/public/home/hfcas_user39/cjding/test-dpgen/dpgen_out/model_devi" }, "resources": { "numb_gpu": 0, "task_per_node": 4, "partition": "cpu", "name":"lmp", "exclude_list": [], "source_list": [], "module_list": [], "time_limit": "23:0:0" }, "command": "lmp", "group_size": 2 } ],

"fp": [ { "machine": { "batch": "slurm", "work_path": "/public/home/hfcas_user39/cjding/test-dpgen/dpgen_out/model_devi" }, "resources": { "numb_gpu": 0, "task_per_node": 4, "numb_node": 1, "with_mpi": false, "name":"fp", "exclude_list": [], "source_list": ["module purge"], "module_list": [ "compiler/intel/2017.5.239", "mpi/hpcx/2.4.1/intel-2017.5.239" ], "time_limit": "12:00:0", "partition": "cpu", "_comment": "that's All" }, "command": "srun --mpi=pmix_v3 /public/software/apps/vtst/5.4.4/hpcx-2.4.1-intel2017/vasp_std", "group_size": 1 } ] }` You shall submit your task not by traditional SLURM script, but directly use "dpgen run . Dpgen will automatically create those scripts in your pre-defined $work_path. Finally you would see the tasks in your HPC queue.

AnguseZhang commented 3 years ago

You use old version of machine.json, which is incompatible. Delete "deepmd_path" in machine.json. You can refer to https://github.com/AnguseZhang/dpgen/blob/devel/examples/machine/DeePMD-kit-1.x/machine-slurm-qe.json and see the explanations.

AnguseZhang commented 3 years ago

You should also use a compatible param.json. Before you see an example, you should check its version. For DeePMD-kit >=1.0 , you can refer to https://github.com/AnguseZhang/dpgen/blob/devel/examples/run/dp1.x-lammps-vasp/CH4/param_CH4_deepmd-kit-1.1.0.json

deepmodeling / dpgen

Incompatible versions of examples. #372

I also encountered this problem when using dpgen with deepmd-kit 1.3.3. However this issue occurs exclusively on HPC machines (Sugon, 中科曙光计算云). With exactly same installation, things work well on a single local compute node.