Closed yaomz16 closed 2 years ago
If you're running srun via an iterative job, that's unsupported: https://arjunacluster.github.io/ArjunaUsers/about/user_software.html#interactive-mpi-jobs
Otherwise, closing as incomplete (no submission script)
No it is not a interactive job, it's submitted via the slurm system. The submission script is at /home/mingzeya/Li_DFT_DP_Gen_project/ML/work/fp/d37ca028d198db8caa18378e3a0db2de3fa82cc9 and the corresponding output file is /home/mingzeya/Li_DFT_DP_Gen_project/ML/work/fp/d37ca028d198db8caa18378e3a0db2de3fa82cc9/task.006.000009/output
The submission script `#!/bin/bash -l
REMOTE_ROOT=/home/mingzeya/Li_DFT_DP_Gen_project/ML/work/fp/d37ca028d198db8caa18378e3a0db2de3fa82cc9 echo 0 > $REMOTE_ROOT/c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail test $? -ne 0 && exit 1
module load cuda/11.4.0
{ source /home/mingzeya/Li_DFT_DP_Gen_project/ML/dpgen/qe.sh; }
cd $REMOTE_ROOT cd task.006.000009 test $? -ne 0 && exit 1 if [ ! -f 1207f30b708c4f03d622f572393660b8d924e81f_task_tag_finished ] ;then ( srun -t 2-00:00:00 --mpi=pmix pw.x -i input ) 1>>output 2>>output if test $? -eq 0; then touch 1207f30b708c4f03d622f572393660b8d924e81f_task_tag_finished; else echo 1 > $REMOTE_ROOT/c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail;fi fi & wait
cd $REMOTE_ROOT test $? -ne 0 && exit 1
wait FLAG_IF_JOB_TASK_FAIL=$(cat c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail) if test $FLAG_IF_JOB_TASK_FAIL -eq 0; then touch c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_job_tag_finished; else exit 1;fi
`
This is a file generated by DP-GEN which works well before, and it suddenly breaks down today
Can you add the following after the {source ...}
line:
which srun
srun --mpi=list
env
And post the output?
I get the following for srun --mpi=list
:
❯ srun --mpi=list
srun: error: spank: x11.so: Plugin file not found
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2
srun: pmix
srun: pmix_v3
/usr/local/sbin/srun srun: error: spank: x11.so: Plugin file not found srun: MPI types are... srun: pmix_v3 srun: pmix srun: pmi2 srun: none srun: cray_shasta SLURM_NODELIST=d021 SLURM_JOB_NAME=c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5.sub
I also have run into this issue with running GPAW. Similarly, it was running well until a couple hours ago.
srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library
srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed
srun: error: cannot create mpi context for mpi/pmix
srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types
Can you provide the ids for the last job that worked and first that did not work?
Ideally for jobs that ran on the same node
Ideally for jobs that ran on the same node
I believe this issue is real for all cpu nodes, because my DP-GEN workflow would occupy most of the available cpu nodes
Yes, all nodes share the same image, so I would hope they'd behave the same.
I’m trying to bound the issue to find the root cause. So far, I haven't found a smoking gun in the head node’s logs. Shifting through the logs of all of the CPU node will take time. You can help by narrowing my search with some information:
If you can provide:
That would be extremely helpful. It would be even more helpful if you could provide that info by the node. For example, on c003, the last successful job was 124, and the first failure was 155)
The first failed job is 238871, and the last successful job is 238357
I don't know if this will be the last successful and first failed, but here is some info
Job 237991 succeeded on d021 at Sun Aug 28 21:32:28 EDT 2022 Job 243142 failed on d021 starting at Tue Aug 30 17:56:45 EDT 2022
@yaomz16
-w
flag in sbatch to do it: sbatch -w f010 ...
@emilannevelink Can you submit a job to d021? Use the -w
flag in sbatch to do it: sbatch -w d021 ...
A bunch of nodes went offline yesterday during the storm. Here's hoping all they need is a reboot
- sbatch -w f010
I submitted a job (243212) to f010, but I cannot see which node my previous jobs are on... Sorry
the output of job 243212 is
srun: error: spank: x11.so: Plugin file not found srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types
and the submission script: `#!/bin/bash -l
module load cuda/11.4.0
{ source /home/mingzeya/Li_DFT_DP_Gen_project/ML/dpgen/qe.sh; }
( srun -t 2-00:00:00 --mpi=pmix pw.x -i input ) 1>>output 2>>output `
and is submitted by sbatch -w f010 test.sh
and I found just now that c023 also has the same problem. I guess all gpu nodes are affected now. The job is 243213 on gpu node c023, the submission script is the same except the partition and the account line
Just ran a test (job id 243250) on d021. I got the same error message
srun: error: spank: x11.so: Plugin file not found
srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library
srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed
srun: error: cannot create mpi context for mpi/pmix
srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types
@yaomz16 For future reference:
❯ sacct -u mingzeya --starttime now-3days --format jobid,jobname,node,start,end,state | grep f010
238143 e729452e5+ f010 2022-08-29T16:10:07 2022-08-29T18:34:02 COMPLETED
238143.batch batch f010 2022-08-29T14:02:55 2022-08-29T18:34:02 COMPLETED
238143.exte+ extern f010 2022-08-29T14:02:55 2022-08-29T18:34:02 COMPLETED
238143.1 pw.x f010 2022-08-29T16:14:07 2022-08-29T18:34:02 COMPLETED
238357 3e016ac4e+ f010 2022-08-30T03:19:53 2022-08-30T05:50:41 COMPLETED
238357.batch batch f010 2022-08-30T03:19:53 2022-08-30T05:50:41 COMPLETED
238357.exte+ extern f010 2022-08-30T03:19:53 2022-08-30T05:50:41 COMPLETED
238357.0 pw.x f010 2022-08-30T03:21:37 2022-08-30T05:50:41 COMPLETED
243021 c23d1b1a4+ f010 2022-08-30T16:31:35 2022-08-30T16:34:16 FAILED
243021.batch batch f010 2022-08-30T16:31:35 2022-08-30T16:34:16 FAILED
243021.exte+ extern f010 2022-08-30T16:31:35 2022-08-30T16:34:16 COMPLETED
243212 test.sh f010 2022-08-30T22:19:30 2022-08-30T22:21:05 FAILED
243212.batch batch f010 2022-08-30T22:19:30 2022-08-30T22:21:05 FAILED
243212.exte+ extern f010 2022-08-30T22:19:30 2022-08-30T22:21:05 COMPLETED
Can you provide the following (Attach file contents here, don't just provide a path):
I'd like it for the following job ids:
The upside is if 238803 and 238871 are the bounding jobs, then whatever happened occurred between 2022-08-30T13:17:22 and 2022-08-30T13:19:11.
@emilannevelink Bummer, I was hoping a reboot would fix it. Looking at your jobs the past few days, they're all showing an exit code of 0 (success). Can you post your submission script?
Here is the submission script as well as the output and error files. It does look like the job finishes without an error, but it doesn't run the line in the script.
- success) - Please confirm this is the same job type
- 238871 (First fail)
- 238357
- 243021
The upside is if 238803 and 238871 are the bounding jobs, then whatever happened occurred between 2022-08-30T13:17:22 and 2022-08-30T13:19:11.
Sorry I cannot provide the output log and error file, as they are all jobs generated by doyen and seems like these files are cleaned up by dp-gen. But for the submission script, they are basically the same as
The submission script `#!/bin/bash -l #SBATCH --parsable #SBATCH --nodes 1 #SBATCH --ntasks-per-node 54 #SBATCH --gres=gpu:0 #SBATCH --partition cpu #SBATCH --mem=108G #SBATCH -t 2-00:00:00 #SBATCH -A venkvis
REMOTE_ROOT=/home/mingzeya/Li_DFT_DP_Gen_project/ML/work/fp/d37ca028d198db8caa18378e3a0db2de3fa82cc9 echo 0 > $REMOTE_ROOT/c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail test $? -ne 0 && exit 1
module load cuda/11.4.0
{ source /home/mingzeya/Li_DFT_DP_Gen_project/ML/dpgen/qe.sh; }
cd $REMOTE_ROOT cd task.006.000009 test $? -ne 0 && exit 1 if [ ! -f 1207f30b708c4f03d622f572393660b8d924e81f_task_tag_finished ] ;then ( srun -t 2-00:00:00 --mpi=pmix pw.x -i input ) 1>>output 2>>output if test $? -eq 0; then touch 1207f30b708c4f03d622f572393660b8d924e81f_task_tag_finished; else echo 1 > $REMOTE_ROOT/c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail;fi fi & wait
cd $REMOTE_ROOT test $? -ne 0 && exit 1
wait FLAG_IF_JOB_TASK_FAIL=$(cat c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail) if test $FLAG_IF_JOB_TASK_FAIL -eq 0; then touch c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_job_tag_finished; else exit 1;fi
`
except that the path is a little bit different
@yaomz16 For future reference:
❯ sacct -u mingzeya --starttime now-3days --format jobid,jobname,node,start,end,state | grep f010 238143 e729452e5+ f010 2022-08-29T16:10:07 2022-08-29T18:34:02 COMPLETED 238143.batch batch f010 2022-08-29T14:02:55 2022-08-29T18:34:02 COMPLETED 238143.exte+ extern f010 2022-08-29T14:02:55 2022-08-29T18:34:02 COMPLETED 238143.1 pw.x f010 2022-08-29T16:14:07 2022-08-29T18:34:02 COMPLETED 238357 3e016ac4e+ f010 2022-08-30T03:19:53 2022-08-30T05:50:41 COMPLETED 238357.batch batch f010 2022-08-30T03:19:53 2022-08-30T05:50:41 COMPLETED 238357.exte+ extern f010 2022-08-30T03:19:53 2022-08-30T05:50:41 COMPLETED 238357.0 pw.x f010 2022-08-30T03:21:37 2022-08-30T05:50:41 COMPLETED 243021 c23d1b1a4+ f010 2022-08-30T16:31:35 2022-08-30T16:34:16 FAILED 243021.batch batch f010 2022-08-30T16:31:35 2022-08-30T16:34:16 FAILED 243021.exte+ extern f010 2022-08-30T16:31:35 2022-08-30T16:34:16 COMPLETED 243212 test.sh f010 2022-08-30T22:19:30 2022-08-30T22:21:05 FAILED 243212.batch batch f010 2022-08-30T22:19:30 2022-08-30T22:21:05 FAILED 243212.exte+ extern f010 2022-08-30T22:19:30 2022-08-30T22:21:05 COMPLETED
Can you provide the following (Attach file contents here, don't just provide a path):
- Submission script
- Output log file
- Output error log
I'd like it for the following job ids:
- 238803 (Last success) - Please confirm this is the same job type
- 238871 (First fail)
- 238357
- 243021
The upside is if 238803 and 238871 are the bounding jobs, then whatever happened occurred between 2022-08-30T13:17:22 and 2022-08-30T13:19:11.
@emilannevelink Bummer, I was hoping a reboot would fix it. Looking at your jobs the past few days, they're all showing an exit code of 0 (success). Can you post your submission script?
Oh luckily I found the submission script, output log file and the output for job 238871 (first fail): submission script: `#!/bin/bash -l
REMOTE_ROOT=/home/mingzeya/Li_DFT_DP_Gen_project/ML/work/fp/5feaf54ba80a35dfeeec2e184b11412f3e34c6ad echo 0 > $REMOTE_ROOT/3ca1eb001ac5abfef83fbc97503bd8b332456743_flag_if_job_task_fail test $? -ne 0 && exit 1
module load cuda/11.4.0
{ source /home/mingzeya/Li_DFT_DP_Gen_project/ML/dpgen/qe.sh; }
cd $REMOTE_ROOT cd task.006.000025 test $? -ne 0 && exit 1 if [ ! -f d9ad0ac7cbc804cd75339957d1e5c36197efdea0_task_tag_finished ] ;then ( srun -t 08:00:00 --mpi=pmix pw.x -i input ) 1>>output 2>>output if test $? -eq 0; then touch d9ad0ac7cbc804cd75339957d1e5c36197efdea0_task_tag_finished; else echo 1 > $REMOTE_ROOT/3ca1eb001ac5abfef83fbc97503bd8b332456743_flag_if_job_task_fail;fi fi & wait
cd $REMOTE_ROOT test $? -ne 0 && exit 1
wait FLAG_IF_JOB_TASK_FAIL=$(cat 3ca1eb001ac5abfef83fbc97503bd8b332456743_flag_if_job_task_fail) if test $FLAG_IF_JOB_TASK_FAIL -eq 0; then touch 3ca1eb001ac5abfef83fbc97503bd8b332456743_job_tag_finished; else exit 1;fi`
output file is slurp-238871.out, which is empty 👎
the output file ("output" in the srun line) is
srun: error: spank: x11.so: Plugin file not found srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types srun: error: spank: x11.so: Plugin file not found srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types srun: error: spank: x11.so: Plugin file not found srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types
@yaomz16 For future reference:
❯ sacct -u mingzeya --starttime now-3days --format jobid,jobname,node,start,end,state | grep f010 238143 e729452e5+ f010 2022-08-29T16:10:07 2022-08-29T18:34:02 COMPLETED 238143.batch batch f010 2022-08-29T14:02:55 2022-08-29T18:34:02 COMPLETED 238143.exte+ extern f010 2022-08-29T14:02:55 2022-08-29T18:34:02 COMPLETED 238143.1 pw.x f010 2022-08-29T16:14:07 2022-08-29T18:34:02 COMPLETED 238357 3e016ac4e+ f010 2022-08-30T03:19:53 2022-08-30T05:50:41 COMPLETED 238357.batch batch f010 2022-08-30T03:19:53 2022-08-30T05:50:41 COMPLETED 238357.exte+ extern f010 2022-08-30T03:19:53 2022-08-30T05:50:41 COMPLETED 238357.0 pw.x f010 2022-08-30T03:21:37 2022-08-30T05:50:41 COMPLETED 243021 c23d1b1a4+ f010 2022-08-30T16:31:35 2022-08-30T16:34:16 FAILED 243021.batch batch f010 2022-08-30T16:31:35 2022-08-30T16:34:16 FAILED 243021.exte+ extern f010 2022-08-30T16:31:35 2022-08-30T16:34:16 COMPLETED 243212 test.sh f010 2022-08-30T22:19:30 2022-08-30T22:21:05 FAILED 243212.batch batch f010 2022-08-30T22:19:30 2022-08-30T22:21:05 FAILED 243212.exte+ extern f010 2022-08-30T22:19:30 2022-08-30T22:21:05 COMPLETED
Can you provide the following (Attach file contents here, don't just provide a path):
- Submission script
- Output log file
- Output error log
I'd like it for the following job ids:
- 238803 (Last success) - Please confirm this is the same job type
- 238871 (First fail)
- 238357
- 243021
The upside is if 238803 and 238871 are the bounding jobs, then whatever happened occurred between 2022-08-30T13:17:22 and 2022-08-30T13:19:11.
@emilannevelink Bummer, I was hoping a reboot would fix it. Looking at your jobs the past few days, they're all showing an exit code of 0 (success). Can you post your submission script?
and I'm so sorry that the 238803 is not the same job type... the actual last success one is 238357, which is a quantum espresso job generated by dpgen. However, as this job is successful, dpgen cleaned up the output log files...
and 243021 seems to be a very small, short but succeeded gpu job generated by dpgen which does not use srun -pmix. Unfortunately DP-GEN cleaned up everything again... sorry
Have you tried running with a different pmi? For example, pmi2?
Not a fix, but maybe enough to get you limping along
I get this error log when using --mpi=pmi2
srun: error: spank: x11.so: Plugin file not found
--------------------------------------------------------------------------
PMI2_Init failed to intialize. Return code: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
PMI2_Init failed to intialize. Return code: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
PMI2_Init failed to intialize. Return code: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
PMI2_Init failed to intialize. Return code: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[d021:04073] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[d021:04074] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[d021:04071] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:
version 16.05 or later: you can use SLURM's PMIx support. This
requires that you configure and build SLURM --with-pmix.
Versions earlier than 16.05: you must use either SLURM's PMI-1 or
PMI-2 support. SLURM builds PMI-1 by default, or you can manually
install PMI-2. You must then build Open MPI using --with-pmi pointing
to the SLURM PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[d021:04072] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: d021: tasks 0-3: Exited with exit code 1
I added the full stack trace to see what the error is following this bug report. It seems like the issue is that srun cannot find the pmix library see step 9 in attached url. In the stack trace it searches a long path, most of which seem to be in your (alex) home directory. I don't know how to find out which path pmix should actually be on, but that is a start to see if the library is in any of those. error.243260.zip
slightly deeper, I tried looking in this path /home/awadell/.spack/opt/spack/linux-centos7-broadwell/gcc-11.2.0/pmix-3.2.3-knc4vugvfrt3kuvrpcxkzg6g5otqdzqe
, but it seems to not exist, which is weird. I tried looking for a different pmix path and found /home/awadell/.spack/opt/spack/linux-centos7-broadwell/gcc-11.2.0/pmix-3.2.3-brq6e2b2bm6zazsnke5mulayxlv5ux47
, but it seems empty which is also weird.
Okay, should be fixed now. Apologies for the mess, looks like we weren't as hermetic as we needed to be when building slurm.
I've built a hopefully equivalent version of pmix and symlinked it to the old path. Please let me know if it's fixed
It is working! Thanks
Your Name
Archie
Andrew ID
mingzeya
Where it Happened
all cpu nodes, one of them is on f010, job id = 243021 (failed now)
What Happened?
I tried to run quantum-espresso jobs on cpu nodes and all jobs failed with the same error: srun: error: spank: x11.so: Plugin file not found srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types
I use the command
srun -t 2-00:00:00 --mpi=pmix pw.x -i input
for the qe jobs, it worked perfectly before.Steps to reproduce
No response
Job Submission Script
No response
What I've tried
No response