deepmodeling / dpdispatcher

generate HPC scheduler systems jobs input scripts and submit these scripts to HPC systems and poke until they finish
https://docs.deepmodeling.com/projects/dpdispatcher/
GNU Lesser General Public License v3.0
45 stars 57 forks source link

sed: can't read script: No such file or directory, when I submit the task using APEX with LocalContext and Slurm. #486

Closed rbjiawen closed 2 months ago

rbjiawen commented 2 months ago

░░░░░░█▐▓▓░████▄▄▄█▀▄▓▓▓▌█░░░░░░░░░░█▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀█░░░░░ ░░░░░▄█▌▀▄▓▓▄▄▄▄▀▀▀▄▓▓▓▓▓▌█░░░░░░░░░█░░░░░░░░▓░░▓░░░░░░░░█░░░░░ ░░░▄█▀▀▄▓█▓▓▓▓▓▓▓▓▓▓▓▓▀░▓▌█░░░░░░░░░█░░░▓░░░░░░░░░▄▄░▓░░░█░▄▄░░ ░░█▀▄▓▓▓███▓▓▓███▓▓▓▄░░▄▓▐██░░░▄▀▀▄▄█░░░░░░░▓░░░░█░░▀▄▄▄▄▄▀░░█░ ░█▌▓▓▓▀▀▓▓▓▓███▓▓▓▓▓▓▓▄▀▓▓▐█░░░█░░░░█░░░░░░░░░░░░█░░░░░░░░░░░█░ ▐█▐██▐░▄▓▓▓▓▓▀▄░▀▓▓▓▓▓▓▓▓▓▌█░░░░▀▀▄▄█░░░░░▓░░░▓░█░░░█▒░░░░█▒░░█ █▌███▓▓▓▓▓▓▓▓▐░░▄▓▓███▓▓▓▄▀▐█░░░░░░░█░░▓░░░░▓░░░█░░░░░░░▀░░░░░█ █▐█▓▀░░▀▓▓▓▓▓▓▓▓▓██████▓▓▓▓▐█░░░░░▄▄█░░░░▓░░░░░░░█░░█▄▄█▄▄█░░█░ ▌▓▄▌▀░▀░▐▀█▄▓▓██████████▓▓▓▌██░░░█░░░█▄▄▄▄▄▄▄▄▄▄█░█▄▄▄▄▄▄▄▄▄█░░ ▌▓▓▓▄▄▀▀▓▓▓▀▓▓▓▓▓▓▓▓█▓█▓█▓▓▌██░░░█▄▄█░░█▄▄█░░░░░░█▄▄█░░█▄▄█░░░░ █▐▓▓▓▓▓▓▄▄▄▓▓▓▓▓▓█▓█▓█▓█▓▓▓▐█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░

    AAAAA         PPPPPPPPP     EEEEEEEEEE  XXX       XXX
   AAA AAA        PPP     PPP   EEE           XXX   XXX
  AAA   AAA       PPP     PPP   EEE            XXX XXX
 AAAAAAAAAAA      PPPPPPPPP     EEEEEEEEE       XXXXX
AAA       AAA     PPP           EEE            XXX XXX

AAA AAA PPP EEE XXX XXX AAA AAA PPP EEEEEEEEEE XXX XXX

==>> Alloy Property EXplorer using simulations (v1.2.8) Please cite DOI: 10.48550/arXiv.2404.17330 Li et al, An extendable cloud-native alloy property explorer (2024). See https://github.com/deepmodeling/APEX for more information.

Checking input files... -------Submit Workflow Mode------- Running APEX calculation via lammps Submitting joint workflow... INFO:root:Working on: /work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo Workflow is running locally (ID: gap-demo-joint-pz26l) INFO:root:Step RelaxLAMMPS-Cal with item 0 starts in process 84151 2024-08-28 13:47:44,362 - INFO : info:check_all_finished: False 2024-08-28 13:47:44,477 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 submit; job_id is 9399253 2024-08-28 14:13:57,054 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 9399253 terminated; fail_cout is 1; resubmitting job 2024-08-28 14:13:57,131 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 re-submit after terminated; new job_id is 9399441 2024-08-28 14:13:57,393 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 job_id:9399441 after re-submitting; the state now is <JobStatus.waiting: 2> 2024-08-28 14:24:32,132 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 9399441 terminated; fail_cout is 2; resubmitting job 2024-08-28 14:24:32,217 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 re-submit after terminated; new job_id is 9399740 2024-08-28 14:24:32,477 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 job_id:9399740 after re-submitting; the state now is <JobStatus.waiting: 2> 2024-08-28 15:16:57,962 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 9399740 terminated; fail_cout is 3; resubmitting job Traceback (most recent call last): File "/work/home/acwrhohb19/anaconda3/envs/apex_gap/lib/python3.8/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state job.handle_unexpected_job_state() File "/work/home/acwrhohb19/anaconda3/envs/apex_gap/lib/python3.8/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state raise RuntimeError(err_msg) RuntimeError: job:fc71a240632e7109f0639ce201ec76327c076e03 9399740 failed 3 times. Possible remote error message: ==> /work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo/remote/3468e40db6124ec6e8be9f458b3aa37ac14d6bc6/./log <== sed: can't read script: No such file or directory sed: can't read script: No such file or directory sed: can't read script: No such file or directory

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/tmp/tmp1p82vt43/gap-demo-joint-pz26l/relaxflow--relax--run-0-lammps/script", line 95, in submission.run_submission(clean=True) …………………………………… ---------------------------------------------------------Split line----------------------------------------------------------

Run command: apex submit -d param_joint.json -c global_hpc_local.json

---------------------------------------------------------Split line---------------------------------------------------------- My global_hpc.json file is as follows:

{ "apex_image_name":"zhuoyli/apex_arm64", "run_image_name": "zhuoyli/apex_arm64", "run_command":"mpirun -np 8 lmp_mpi-in in.lammps", "group_size": 1, "batch_type": "Slurm", "context_type": "LocalContext", "machine": { "batch_type": "Slurm", "context_type": "LocalContext", "local_root": "./", "remote_root": "/work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo/remote", "clean_asynchronously": false }, "resources":{ "number_node": 1, "cpu_per_node": 8, "gpu_per_node": 0, "queue_name": "xhacnormalb", "group_size": 1, "module_purge": true, "module_list": ["compiler/devtoolset/7.3.1", "mpi/hpcx/gcc-7.3.1", "mathlib/fftw/3.3.8-gcc-7.3.1-double", "compiler/cmake/3.23.3", "mathlib/lapack/3.8.0-gcc-7.3.1"], "source_list": ["/work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo/lammps-quip.sh"], "custom_flags": [ "#SBATCH --job-name=GAP_demo", "#SBATCH --partition=xhacnormalb", "#SBATCH --ntasks=8", "#SBATCH --time=1-00:00:00" ] } }

---------------------------------------------------------Split line----------------------------------------------------------

In the tmp dir:

the *.sub.run file:

cd $REMOTE_ROOT cd . test $? -ne 0 && exit 1 if [ ! -f ec8c25739a7dfb1a7ddc1ef4cf92a0fc33d64998_task_tag_finished ] ;then ( sed -i "s#\$(pwd)#$(pwd)#g" script && python3 script ) 1>>log 2>>log if test $? -eq 0; then touch ec8c25739a7dfb1a7ddc1ef4cf92a0fc33d64998_task_tag_finished; else echo 1 > $REMOTE_ROOT/218928143f7f92d20fcd0442ff2c75eab569b52d_flag_if_job_task_fail;tail -v -c 1000 $REMOTE_ROOT/./log > $REMOTE_ROOT/218928143f7f92d20fcd0442ff2c75eab569b52d_last_err_file;fi fi & wait

So, why the sed can't find the script. When I use:

ls $REMOTE_ROOT ls -l script

echo "current workdir: $(pwd)" cat $(readlink -f script)

in the source file, the slurm.out us as follows:

3468e40db6124ec6e8be9f458b3aa37ac14d6bc6.json fc71a240632e7109f0639ce201ec76327c076e03_flag_if_job_task_fail fc71a240632e7109f0639ce201ec76327c076e03_job_id fc71a240632e7109f0639ce201ec76327c076e03_last_err_file fc71a240632e7109f0639ce201ec76327c076e03.sub fc71a240632e7109f0639ce201ec76327c076e03.sub.run log script slurm-9399253.out slurm-9399441.out slurm-9399740.out tmp

lrwxrwxr-x 1 acwrhohb19 acwrhohb19 87 Aug 28 13:47 script -> /tmp/tmp1p82vt43/gap-demo-joint-pz26l/relaxflow--relax--run-0-lammps/workdir/././script current workdir: /work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo/remote/3468e40db6124ec6e8be9f458b3aa37ac14d6bc6

the script seem to be a link file.

How to debug it?

njzjz commented 2 months ago

In HPCs, usually different nodes have its own /tmp directory.