deepmodeling / dpdispatcher

generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish
https://docs.deepmodeling.com/projects/dpdispatcher/
GNU Lesser General Public License v3.0
42 stars 56 forks source link

List index out of range when running pbs.py #445

Closed NKJunhongLi closed 5 months ago

NKJunhongLi commented 5 months ago

The PBS system version I am using is:

Version: 6.1.1.1 Commit: 22f28343b8ee83b1234479b20224353f6c2db317

I have already written the machine.json file for it. machine_own.json And when I am trying to run: dpgen init_bulk init.json machine_own.json there comes an error:

Traceback (most recent call last): File "/home/lijh/anaconda3/envs/deepmd/bin/dpgen", line 8, in sys.exit(main()) File "/home/lijh/anaconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/main.py", line 255, in main args.func(args) File "/home/lijh/anaconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/data/gen.py", line 1549, in gen_init_bulk run_abacus_relax(jdata, mdata) File "/home/lijh/anaconda3/envs/deepmd/lib/python3.10/site-packages/dpgen/data/gen.py", line 1327, in run_abacus_relax submission.run_submission() File "/home/lijh/anaconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 221, in run_submission self.update_submission_state() File "/home/lijh/anaconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 345, in update_submission_state job.get_job_state() File "/home/lijh/anaconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/submission.py", line 831, in get_job_state job_state = self.machine.check_status(self) File "/home/lijh/anaconda3/envs/deepmd/lib/python3.10/site-packages/dpdispatcher/machines/pbs.py", line 92, in check_status status_word = status_line.split()[-2] IndexError: list index out of range

The output of qstat -x jobID on my system is: (take another job 506995.mu01 for an example)

<?xml version="1.0"?>

506995.mu01DeePMD_example_CH4lijh@mu01Rgold5120mu01u1711448228mu01:/home/lijh/DeePMD/CH4/01.train/intra14_inter2/$PBS_JOBID.errcu13/0-27nnna1711448229mu01:/home/lijh/DeePMD/CH4/01.train/intra14_inter2/$PBS_JOBID.log01711448228True1:ppn=281152:00:00199441PBS_O_QUEUE=gold5120,PBS_O_HOME=/home/lijh,PBS_O_LOGNAME=lijh,PBS_O_PATH=/home/lijh/python3/bin:/home/lijh/abacus/bin:/home/lijh/local/x86_64-pc-linux-gnu/bin:/home/lijh/local/bin:/home/lijh/anaconda3/envs/deepmd/bin:/home/lijh/python3/bin:/home/lijh/abacus/bin:/home/lijh/local/x86_64-pc-linux-gnu/bin:/home/lijh/local/bin:/home/lijh/intel/oneapi/vtune/2024.0/bin64:/home/lijh/intel/oneapi/mpi/2021.11/opt/mpi/libfabric/bin:/home/lijh/intel/oneapi/mpi/2021.11/bin:/home/lijh/intel/oneapi/mkl/2024.0/bin:/home/lijh/intel/oneapi/itac/2022.0/bin:/home/lijh/intel/oneapi/inspector/2024.0/bin64:/home/lijh/intel/oneapi/dpcpp-ct/2024.0/bin:/home/lijh/intel/oneapi/dev-utilities/2024.0/bin:/home/lijh/intel/oneapi/debugger/2024.0/opt/debugger/bin:/home/lijh/intel/oneapi/compiler/2024.0/opt/oclfpga/bin:/home/lijh/intel/oneapi/compiler/2024.0/bin:/home/lijh/intel/oneapi/advisor/2024.0/bin64:/home/lijh/anaconda3/condabin:/usr/lib64/qt-3.3/bin:/home/lijh/perl5/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/opt/maui-3.3.1/bin:/opt/torque-6.1.1.1/bin:/opt/bin:/home/lijh/.local/bin:/home/lijh/bin,PBS_O_MAIL=/var/spool/mail/lijh,PBS_O_SHELL=/bin/bash,PBS_O_LANG=en_US.UTF-8,PBS_O_WORKDIR=/home/lijh/DeePMD/CH4/01.train/intra14_inter2,PBS_O_HOST=mu01,PBS_O_SERVER=mu01lijhlijhE1711448228run.sh171144822941471881False0mu01/home/lijh/DeePMD/CH4/01.train/intra14_inter21

Clearly the "job state" information is not at the word[-2] of line[-2]。 I wonder how to adjust the code in "pbs.py" for fitting my PBS qstat -x output.

njzjz commented 5 months ago

This commit is from torque: https://github.com/adaptivecomputing/torque/commit/22f28343b8ee83b1234479b20224353f6c2db317 Use batch_type torque instead of pbs (stands for openpbs).

njzjz commented 5 months ago

I found it a duplicate of #84.