deepmodeling / dpdispatcher

generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish
https://docs.deepmodeling.com/projects/dpdispatcher/
GNU Lesser General Public License v3.0
42 stars 56 forks source link

About machine.json using lsf #378

Open DM0815 opened 1 year ago

DM0815 commented 1 year ago

When I use lsf queue system to conduct dpgen in logining node of server cluster.After submitting the command,it reminds "RuntimeError: Meet errors will handle unexpected submission state." and suggest me to see the remote_root.But there are no mistake information in work dir. And in dp task dir, the jobs is still runing, the train.log is ok. And I can the jobs in queue system. I don't know where wrong, can you give me some hints. machine.jsons and mistake informarion attached.

machine.json: { "api_version": "1.0", "_deepmd_version": "2.1.0", "train" : { "command": "dp", "machine": { "batch_type": "LSF", "context_type": "local", "local_root" : "./", "remote_root":"/public/home/dmeng/DPGEN/0316testlsf/tmp" }, "resources": { "number_node": 1, "cpu_per_node": 8, "gpu_per_node": 0, "queue_name":"normal", "group_size": 2, "_batch_type": "LSF", "_kwargs": {}, "source_list":["/public/home/dmeng/anaconda3/bin/activate deepmd"] } }, "model_devi": { "command": "lmp -i input.lammps -v restart 0", "machine": { "batch_type": "LSF", "context_type": "local", "local_root" : "./", "remote_root":"/public/home/dmeng/DPGEN/0316testlsf/tmp"

  },
  "resources": {
    "number_node": 1,
    "cpu_per_node": 8,
    "gpu_per_node": 0,
     "queue_name":"normal",
    "group_size": 100,
    "_batch_type": "LSF",
     "_kwargs": {},
    "source_list":["/public/home/dmeng/anaconda3/bin/activate deepmd"]
  }
},

"fp": { "command": "ulimit -s unlimited && mpirun -n 8 /public/home/dmeng/softwares/vasp.5.4/bin/vasp_std", "machine": { "batch_type": "LSF", "context_type": "local", "local_root" : "./", "remote_root":"/public/home/dmeng/DPGEN/0316testlsf/tmp" }, "resources": { "number_node": 1, "cpu_per_node": 8, "gpu_per_node": 0, "queue_name":"normal", "group_size": 50, "_batch_type": "LSF", "_kwargs": {}, "source_list": ["/public/softwares/intel/oneapi/setvars.sh"] } } }

1678950833716 1678951498610 1678951517568 1678951549987
njzjz commented 1 year ago

image

Please provide the "above exception" mentioned in your error message. Thanks.

DM0815 commented 1 year ago

image

Please provide the "above exception" mentioned in your error message. Thanks.

I'm sorry for replying late. image

12jscvb commented 11 months ago

Hi I also encounter the same problem, Do you have solved the error?

njzjz commented 10 months ago

Hi I also encounter the same problem, Do you have solved the error?

At this time, we don't have access to any LSF node. If you have found what is wrong with dpdispatcher, feel free to contribute to the code.