==>> Alloy Property EXplorer using simulations (v1.2.8)
Please cite DOI: 10.48550/arXiv.2404.17330
Li et al, An extendable cloud-native alloy property explorer (2024).
See https://github.com/deepmodeling/APEX for more information.
Checking input files...
-------Submit Workflow Mode-------
Running APEX calculation via lammps
Submitting joint workflow...
INFO:root:Working on: /work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo
Workflow is running locally (ID: gap-demo-joint-pz26l)
INFO:root:Step RelaxLAMMPS-Cal with item 0 starts in process 84151
2024-08-28 13:47:44,362 - INFO : info:check_all_finished: False
2024-08-28 13:47:44,477 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 submit; job_id is 9399253
2024-08-28 14:13:57,054 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 9399253 terminated; fail_cout is 1; resubmitting job
2024-08-28 14:13:57,131 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 re-submit after terminated; new job_id is 9399441
2024-08-28 14:13:57,393 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 job_id:9399441 after re-submitting; the state now is <JobStatus.waiting: 2>
2024-08-28 14:24:32,132 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 9399441 terminated; fail_cout is 2; resubmitting job
2024-08-28 14:24:32,217 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 re-submit after terminated; new job_id is 9399740
2024-08-28 14:24:32,477 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 job_id:9399740 after re-submitting; the state now is <JobStatus.waiting: 2>
2024-08-28 15:16:57,962 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 9399740 terminated; fail_cout is 3; resubmitting job
Traceback (most recent call last):
File "/work/home/acwrhohb19/anaconda3/envs/apex_gap/lib/python3.8/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state
job.handle_unexpected_job_state()
File "/work/home/acwrhohb19/anaconda3/envs/apex_gap/lib/python3.8/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state
raise RuntimeError(err_msg)
RuntimeError: job:fc71a240632e7109f0639ce201ec76327c076e03 9399740 failed 3 times.
Possible remote error message: ==> /work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo/remote/3468e40db6124ec6e8be9f458b3aa37ac14d6bc6/./log <==
sed: can't read script: No such file or directory
sed: can't read script: No such file or directory
sed: can't read script: No such file or directory
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/tmp/tmp1p82vt43/gap-demo-joint-pz26l/relaxflow--relax--run-0-lammps/script", line 95, in
submission.run_submission(clean=True)
……………………………………
---------------------------------------------------------Split line----------------------------------------------------------
Run command:
apex submit -d param_joint.json -c global_hpc_local.json
---------------------------------------------------------Split line----------------------------------------------------------
My global_hpc.json file is as follows:
░░░░░░█▐▓▓░████▄▄▄█▀▄▓▓▓▌█░░░░░░░░░░█▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀▀█░░░░░ ░░░░░▄█▌▀▄▓▓▄▄▄▄▀▀▀▄▓▓▓▓▓▌█░░░░░░░░░█░░░░░░░░▓░░▓░░░░░░░░█░░░░░ ░░░▄█▀▀▄▓█▓▓▓▓▓▓▓▓▓▓▓▓▀░▓▌█░░░░░░░░░█░░░▓░░░░░░░░░▄▄░▓░░░█░▄▄░░ ░░█▀▄▓▓▓███▓▓▓███▓▓▓▄░░▄▓▐██░░░▄▀▀▄▄█░░░░░░░▓░░░░█░░▀▄▄▄▄▄▀░░█░ ░█▌▓▓▓▀▀▓▓▓▓███▓▓▓▓▓▓▓▄▀▓▓▐█░░░█░░░░█░░░░░░░░░░░░█░░░░░░░░░░░█░ ▐█▐██▐░▄▓▓▓▓▓▀▄░▀▓▓▓▓▓▓▓▓▓▌█░░░░▀▀▄▄█░░░░░▓░░░▓░█░░░█▒░░░░█▒░░█ █▌███▓▓▓▓▓▓▓▓▐░░▄▓▓███▓▓▓▄▀▐█░░░░░░░█░░▓░░░░▓░░░█░░░░░░░▀░░░░░█ █▐█▓▀░░▀▓▓▓▓▓▓▓▓▓██████▓▓▓▓▐█░░░░░▄▄█░░░░▓░░░░░░░█░░█▄▄█▄▄█░░█░ ▌▓▄▌▀░▀░▐▀█▄▓▓██████████▓▓▓▌██░░░█░░░█▄▄▄▄▄▄▄▄▄▄█░█▄▄▄▄▄▄▄▄▄█░░ ▌▓▓▓▄▄▀▀▓▓▓▀▓▓▓▓▓▓▓▓█▓█▓█▓▓▌██░░░█▄▄█░░█▄▄█░░░░░░█▄▄█░░█▄▄█░░░░ █▐▓▓▓▓▓▓▄▄▄▓▓▓▓▓▓█▓█▓█▓█▓▓▓▐█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
AAA AAA PPP EEE XXX XXX AAA AAA PPP EEEEEEEEEE XXX XXX
==>> Alloy Property EXplorer using simulations (v1.2.8) Please cite DOI: 10.48550/arXiv.2404.17330 Li et al, An extendable cloud-native alloy property explorer (2024). See https://github.com/deepmodeling/APEX for more information.
Checking input files... -------Submit Workflow Mode------- Running APEX calculation via lammps Submitting joint workflow... INFO:root:Working on: /work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo Workflow is running locally (ID: gap-demo-joint-pz26l) INFO:root:Step RelaxLAMMPS-Cal with item 0 starts in process 84151 2024-08-28 13:47:44,362 - INFO : info:check_all_finished: False 2024-08-28 13:47:44,477 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 submit; job_id is 9399253 2024-08-28 14:13:57,054 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 9399253 terminated; fail_cout is 1; resubmitting job 2024-08-28 14:13:57,131 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 re-submit after terminated; new job_id is 9399441 2024-08-28 14:13:57,393 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 job_id:9399441 after re-submitting; the state now is <JobStatus.waiting: 2> 2024-08-28 14:24:32,132 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 9399441 terminated; fail_cout is 2; resubmitting job 2024-08-28 14:24:32,217 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 re-submit after terminated; new job_id is 9399740 2024-08-28 14:24:32,477 - INFO : job:fc71a240632e7109f0639ce201ec76327c076e03 job_id:9399740 after re-submitting; the state now is <JobStatus.waiting: 2> 2024-08-28 15:16:57,962 - INFO : job: fc71a240632e7109f0639ce201ec76327c076e03 9399740 terminated; fail_cout is 3; resubmitting job Traceback (most recent call last): File "/work/home/acwrhohb19/anaconda3/envs/apex_gap/lib/python3.8/site-packages/dpdispatcher/submission.py", line 358, in handle_unexpected_submission_state job.handle_unexpected_job_state() File "/work/home/acwrhohb19/anaconda3/envs/apex_gap/lib/python3.8/site-packages/dpdispatcher/submission.py", line 862, in handle_unexpected_job_state raise RuntimeError(err_msg) RuntimeError: job:fc71a240632e7109f0639ce201ec76327c076e03 9399740 failed 3 times. Possible remote error message: ==> /work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo/remote/3468e40db6124ec6e8be9f458b3aa37ac14d6bc6/./log <== sed: can't read script: No such file or directory sed: can't read script: No such file or directory sed: can't read script: No such file or directory
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/tmp/tmp1p82vt43/gap-demo-joint-pz26l/relaxflow--relax--run-0-lammps/script", line 95, in
submission.run_submission(clean=True)
……………………………………
---------------------------------------------------------Split line----------------------------------------------------------
Run command: apex submit -d param_joint.json -c global_hpc_local.json
---------------------------------------------------------Split line---------------------------------------------------------- My global_hpc.json file is as follows:
{ "apex_image_name":"zhuoyli/apex_arm64", "run_image_name": "zhuoyli/apex_arm64", "run_command":"mpirun -np 8 lmp_mpi-in in.lammps", "group_size": 1, "batch_type": "Slurm", "context_type": "LocalContext", "machine": { "batch_type": "Slurm", "context_type": "LocalContext", "local_root": "./", "remote_root": "/work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo/remote", "clean_asynchronously": false }, "resources":{ "number_node": 1, "cpu_per_node": 8, "gpu_per_node": 0, "queue_name": "xhacnormalb", "group_size": 1, "module_purge": true, "module_list": ["compiler/devtoolset/7.3.1", "mpi/hpcx/gcc-7.3.1", "mathlib/fftw/3.3.8-gcc-7.3.1-double", "compiler/cmake/3.23.3", "mathlib/lapack/3.8.0-gcc-7.3.1"], "source_list": ["/work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo/lammps-quip.sh"], "custom_flags": [ "#SBATCH --job-name=GAP_demo", "#SBATCH --partition=xhacnormalb", "#SBATCH --ntasks=8", "#SBATCH --time=1-00:00:00" ] } }
---------------------------------------------------------Split line----------------------------------------------------------
In the tmp dir:
the *.sub.run file:
cd $REMOTE_ROOT cd . test $? -ne 0 && exit 1 if [ ! -f ec8c25739a7dfb1a7ddc1ef4cf92a0fc33d64998_task_tag_finished ] ;then ( sed -i "s#\$(pwd)#$(pwd)#g" script && python3 script ) 1>>log 2>>log if test $? -eq 0; then touch ec8c25739a7dfb1a7ddc1ef4cf92a0fc33d64998_task_tag_finished; else echo 1 > $REMOTE_ROOT/218928143f7f92d20fcd0442ff2c75eab569b52d_flag_if_job_task_fail;tail -v -c 1000 $REMOTE_ROOT/./log > $REMOTE_ROOT/218928143f7f92d20fcd0442ff2c75eab569b52d_last_err_file;fi fi & wait
So, why the sed can't find the script. When I use:
ls $REMOTE_ROOT ls -l script
echo "current workdir: $(pwd)" cat $(readlink -f script)
in the source file, the slurm.out us as follows:
3468e40db6124ec6e8be9f458b3aa37ac14d6bc6.json fc71a240632e7109f0639ce201ec76327c076e03_flag_if_job_task_fail fc71a240632e7109f0639ce201ec76327c076e03_job_id fc71a240632e7109f0639ce201ec76327c076e03_last_err_file fc71a240632e7109f0639ce201ec76327c076e03.sub fc71a240632e7109f0639ce201ec76327c076e03.sub.run log script slurm-9399253.out slurm-9399441.out slurm-9399740.out tmp
lrwxrwxr-x 1 acwrhohb19 acwrhohb19 87 Aug 28 13:47 script -> /tmp/tmp1p82vt43/gap-demo-joint-pz26l/relaxflow--relax--run-0-lammps/workdir/././script current workdir: /work/home/acwrhohb19/jwcao/apex_eval/lammps_demo/GAP_demo/remote/3468e40db6124ec6e8be9f458b3aa37ac14d6bc6
the script seem to be a link file.
How to debug it?