yaomz16 commented 2 years ago

Your Name

Archie

Andrew ID

mingzeya

Where it Happened

all cpu nodes, one of them is on f010, job id = 243021 (failed now)

What Happened?

I tried to run quantum-espresso jobs on cpu nodes and all jobs failed with the same error: srun: error: spank: x11.so: Plugin file not found srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types

I use the command srun -t 2-00:00:00 --mpi=pmix pw.x -i input for the qe jobs, it worked perfectly before.

Steps to reproduce

No response

Job Submission Script

No response

What I've tried

No response

awadell1 commented 2 years ago

If you're running srun via an iterative job, that's unsupported: https://arjunacluster.github.io/ArjunaUsers/about/user_software.html#interactive-mpi-jobs

Otherwise, closing as incomplete (no submission script)

yaomz16 commented 2 years ago

No it is not a interactive job, it's submitted via the slurm system. The submission script is at /home/mingzeya/Li_DFT_DP_Gen_project/ML/work/fp/d37ca028d198db8caa18378e3a0db2de3fa82cc9 and the corresponding output file is /home/mingzeya/Li_DFT_DP_Gen_project/ML/work/fp/d37ca028d198db8caa18378e3a0db2de3fa82cc9/task.006.000009/output

yaomz16 commented 2 years ago

The submission script `#!/bin/bash -l

SBATCH --parsable

SBATCH --nodes 1

SBATCH --ntasks-per-node 54

SBATCH --gres=gpu:0

SBATCH --partition cpu

SBATCH --mem=108G

SBATCH -t 2-00:00:00

SBATCH -A venkvis

REMOTE_ROOT=/home/mingzeya/Li_DFT_DP_Gen_project/ML/work/fp/d37ca028d198db8caa18378e3a0db2de3fa82cc9 echo 0 > $REMOTE_ROOT/c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail test $? -ne 0 && exit 1

module load cuda/11.4.0

{ source /home/mingzeya/Li_DFT_DP_Gen_project/ML/dpgen/qe.sh; }

cd $REMOTE_ROOT cd task.006.000009 test $? -ne 0 && exit 1 if [ ! -f 1207f30b708c4f03d622f572393660b8d924e81f_task_tag_finished ] ;then ( srun -t 2-00:00:00 --mpi=pmix pw.x -i input ) 1>>output 2>>output if test $? -eq 0; then touch 1207f30b708c4f03d622f572393660b8d924e81f_task_tag_finished; else echo 1 > $REMOTE_ROOT/c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail;fi fi & wait

cd $REMOTE_ROOT test $? -ne 0 && exit 1

wait FLAG_IF_JOB_TASK_FAIL=$(cat c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail) if test $FLAG_IF_JOB_TASK_FAIL -eq 0; then touch c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_job_tag_finished; else exit 1;fi

`

This is a file generated by DP-GEN which works well before, and it suddenly breaks down today

awadell1 commented 2 years ago

Can you add the following after the {source ...} line:

which srun
srun --mpi=list
env

And post the output?

I get the following for srun --mpi=list:

❯ srun --mpi=list
srun: error: spank: x11.so: Plugin file not found
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2
srun: pmix
srun: pmix_v3

yaomz16 commented 2 years ago

/usr/local/sbin/srun srun: error: spank: x11.so: Plugin file not found srun: MPI types are... srun: pmix_v3 srun: pmix srun: pmi2 srun: none srun: cray_shasta SLURM_NODELIST=d021 SLURM_JOB_NAME=c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5.sub

emilannevelink commented 2 years ago

I also have run into this issue with running GPAW. Similarly, it was running well until a couple hours ago.

srun: error:  mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library
srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed
srun: error: cannot create mpi context for mpi/pmix
srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types

awadell1 commented 2 years ago

Can you provide the ids for the last job that worked and first that did not work?

awadell1 commented 2 years ago

Ideally for jobs that ran on the same node

yaomz16 commented 2 years ago

Ideally for jobs that ran on the same node

I believe this issue is real for all cpu nodes, because my DP-GEN workflow would occupy most of the available cpu nodes

awadell1 commented 2 years ago

Yes, all nodes share the same image, so I would hope they'd behave the same.

I’m trying to bound the issue to find the root cause. So far, I haven't found a smoking gun in the head node’s logs. Shifting through the logs of all of the CPU node will take time. You can help by narrowing my search with some information:

If you can provide:

The job id of the last successful job
The job id of the first failed job

That would be extremely helpful. It would be even more helpful if you could provide that info by the node. For example, on c003, the last successful job was 124, and the first failure was 155)

yaomz16 commented 2 years ago

The first failed job is 238871, and the last successful job is 238357

emilannevelink commented 2 years ago

I don't know if this will be the last successful and first failed, but here is some info

Job 237991 succeeded on d021 at Sun Aug 28 21:32:28 EDT 2022 Job 243142 failed on d021 starting at Tue Aug 30 17:56:45 EDT 2022

awadell1 commented 2 years ago

@yaomz16

Can you submit a job to f010? Use the -w flag in sbatch to do it: sbatch -w f010 ...
Can you provide all failed job ids for f010?

@emilannevelink Can you submit a job to d021? Use the -w flag in sbatch to do it: sbatch -w d021 ...

A bunch of nodes went offline yesterday during the storm. Here's hoping all they need is a reboot

yaomz16 commented 2 years ago

sbatch -w f010

I submitted a job (243212) to f010, but I cannot see which node my previous jobs are on... Sorry

yaomz16 commented 2 years ago

the output of job 243212 is srun: error: spank: x11.so: Plugin file not found srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types

and the submission script: `#!/bin/bash -l

SBATCH --parsable

SBATCH --nodes 1

SBATCH --ntasks-per-node 54

SBATCH --gres=gpu:0

SBATCH --partition cpu

SBATCH --mem=108G

SBATCH -t 2-00:00:00

SBATCH -A venkvis

module load cuda/11.4.0

{ source /home/mingzeya/Li_DFT_DP_Gen_project/ML/dpgen/qe.sh; }

( srun -t 2-00:00:00 --mpi=pmix pw.x -i input ) 1>>output 2>>output `

and is submitted by sbatch -w f010 test.sh

yaomz16 commented 2 years ago

and I found just now that c023 also has the same problem. I guess all gpu nodes are affected now. The job is 243213 on gpu node c023, the submission script is the same except the partition and the account line

emilannevelink commented 2 years ago

Just ran a test (job id 243250) on d021. I got the same error message

srun: error: spank: x11.so: Plugin file not found
srun: error:  mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library
srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed
srun: error: cannot create mpi context for mpi/pmix
srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types

awadell1 commented 2 years ago

@yaomz16 For future reference:

❯ sacct -u mingzeya --starttime now-3days --format jobid,jobname,node,start,end,state | grep f010
238143       e729452e5+            f010 2022-08-29T16:10:07 2022-08-29T18:34:02  COMPLETED
238143.batch      batch            f010 2022-08-29T14:02:55 2022-08-29T18:34:02  COMPLETED
238143.exte+     extern            f010 2022-08-29T14:02:55 2022-08-29T18:34:02  COMPLETED
238143.1           pw.x            f010 2022-08-29T16:14:07 2022-08-29T18:34:02  COMPLETED
238357       3e016ac4e+            f010 2022-08-30T03:19:53 2022-08-30T05:50:41  COMPLETED
238357.batch      batch            f010 2022-08-30T03:19:53 2022-08-30T05:50:41  COMPLETED
238357.exte+     extern            f010 2022-08-30T03:19:53 2022-08-30T05:50:41  COMPLETED
238357.0           pw.x            f010 2022-08-30T03:21:37 2022-08-30T05:50:41  COMPLETED
243021       c23d1b1a4+            f010 2022-08-30T16:31:35 2022-08-30T16:34:16     FAILED
243021.batch      batch            f010 2022-08-30T16:31:35 2022-08-30T16:34:16     FAILED
243021.exte+     extern            f010 2022-08-30T16:31:35 2022-08-30T16:34:16  COMPLETED
243212          test.sh            f010 2022-08-30T22:19:30 2022-08-30T22:21:05     FAILED
243212.batch      batch            f010 2022-08-30T22:19:30 2022-08-30T22:21:05     FAILED
243212.exte+     extern            f010 2022-08-30T22:19:30 2022-08-30T22:21:05  COMPLETED

Can you provide the following (Attach file contents here, don't just provide a path):

Submission script
Output log file
Output error log

I'd like it for the following job ids:

238803 (Last success) - Please confirm this is the same job type
238871 (First fail)
238357
243021

The upside is if 238803 and 238871 are the bounding jobs, then whatever happened occurred between 2022-08-30T13:17:22 and 2022-08-30T13:19:11.

@emilannevelink Bummer, I was hoping a reboot would fix it. Looking at your jobs the past few days, they're all showing an exit code of 0 (success). Can you post your submission script?

emilannevelink commented 2 years ago

Here is the submission script as well as the output and error files. It does look like the job finishes without an error, but it doesn't run the line in the script.

Archive.zip

yaomz16 commented 2 years ago

success) - Please confirm this is the same job type

238871 (First fail)

238357

243021

The upside is if 238803 and 238871 are the bounding jobs, then whatever happened occurred between 2022-08-30T13:17:22 and 2022-08-30T13:19:11.

Sorry I cannot provide the output log and error file, as they are all jobs generated by doyen and seems like these files are cleaned up by dp-gen. But for the submission script, they are basically the same as

The submission script `#!/bin/bash -l #SBATCH --parsable #SBATCH --nodes 1 #SBATCH --ntasks-per-node 54 #SBATCH --gres=gpu:0 #SBATCH --partition cpu #SBATCH --mem=108G #SBATCH -t 2-00:00:00 #SBATCH -A venkvis

REMOTE_ROOT=/home/mingzeya/Li_DFT_DP_Gen_project/ML/work/fp/d37ca028d198db8caa18378e3a0db2de3fa82cc9 echo 0 > $REMOTE_ROOT/c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail test $? -ne 0 && exit 1

module load cuda/11.4.0

{ source /home/mingzeya/Li_DFT_DP_Gen_project/ML/dpgen/qe.sh; }

cd $REMOTE_ROOT cd task.006.000009 test $? -ne 0 && exit 1 if [ ! -f 1207f30b708c4f03d622f572393660b8d924e81f_task_tag_finished ] ;then ( srun -t 2-00:00:00 --mpi=pmix pw.x -i input ) 1>>output 2>>output if test $? -eq 0; then touch 1207f30b708c4f03d622f572393660b8d924e81f_task_tag_finished; else echo 1 > $REMOTE_ROOT/c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail;fi fi & wait

cd $REMOTE_ROOT test $? -ne 0 && exit 1

wait FLAG_IF_JOB_TASK_FAIL=$(cat c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_flag_if_job_task_fail) if test $FLAG_IF_JOB_TASK_FAIL -eq 0; then touch c23d1b1a4219abb2401d6dfd20bf4552a5d0f8a5_job_tag_finished; else exit 1;fi

`

except that the path is a little bit different

yaomz16 commented 2 years ago

@yaomz16 For future reference:
❯ sacct -u mingzeya --starttime now-3days --format jobid,jobname,node,start,end,state | grep f010
238143       e729452e5+            f010 2022-08-29T16:10:07 2022-08-29T18:34:02  COMPLETED
238143.batch      batch            f010 2022-08-29T14:02:55 2022-08-29T18:34:02  COMPLETED
238143.exte+     extern            f010 2022-08-29T14:02:55 2022-08-29T18:34:02  COMPLETED
238143.1           pw.x            f010 2022-08-29T16:14:07 2022-08-29T18:34:02  COMPLETED
238357       3e016ac4e+            f010 2022-08-30T03:19:53 2022-08-30T05:50:41  COMPLETED
238357.batch      batch            f010 2022-08-30T03:19:53 2022-08-30T05:50:41  COMPLETED
238357.exte+     extern            f010 2022-08-30T03:19:53 2022-08-30T05:50:41  COMPLETED
238357.0           pw.x            f010 2022-08-30T03:21:37 2022-08-30T05:50:41  COMPLETED
243021       c23d1b1a4+            f010 2022-08-30T16:31:35 2022-08-30T16:34:16     FAILED
243021.batch      batch            f010 2022-08-30T16:31:35 2022-08-30T16:34:16     FAILED
243021.exte+     extern            f010 2022-08-30T16:31:35 2022-08-30T16:34:16  COMPLETED
243212          test.sh            f010 2022-08-30T22:19:30 2022-08-30T22:21:05     FAILED
243212.batch      batch            f010 2022-08-30T22:19:30 2022-08-30T22:21:05     FAILED
243212.exte+     extern            f010 2022-08-30T22:19:30 2022-08-30T22:21:05  COMPLETED
Can you provide the following (Attach file contents here, don't just provide a path):

Submission script

Output log file

Output error log

I'd like it for the following job ids:

238803 (Last success) - Please confirm this is the same job type

238871 (First fail)

238357

243021

The upside is if 238803 and 238871 are the bounding jobs, then whatever happened occurred between 2022-08-30T13:17:22 and 2022-08-30T13:19:11.

@emilannevelink Bummer, I was hoping a reboot would fix it. Looking at your jobs the past few days, they're all showing an exit code of 0 (success). Can you post your submission script?

Oh luckily I found the submission script, output log file and the output for job 238871 (first fail): submission script: `#!/bin/bash -l

SBATCH --parsable

SBATCH --nodes 1

SBATCH --ntasks-per-node 54

SBATCH --gres=gpu:0

SBATCH --partition cpu

SBATCH --mem=108G

SBATCH -t 2-00:00:00

SBATCH -A venkvis

REMOTE_ROOT=/home/mingzeya/Li_DFT_DP_Gen_project/ML/work/fp/5feaf54ba80a35dfeeec2e184b11412f3e34c6ad echo 0 > $REMOTE_ROOT/3ca1eb001ac5abfef83fbc97503bd8b332456743_flag_if_job_task_fail test $? -ne 0 && exit 1

module load cuda/11.4.0

{ source /home/mingzeya/Li_DFT_DP_Gen_project/ML/dpgen/qe.sh; }

cd $REMOTE_ROOT cd task.006.000025 test $? -ne 0 && exit 1 if [ ! -f d9ad0ac7cbc804cd75339957d1e5c36197efdea0_task_tag_finished ] ;then ( srun -t 08:00:00 --mpi=pmix pw.x -i input ) 1>>output 2>>output if test $? -eq 0; then touch d9ad0ac7cbc804cd75339957d1e5c36197efdea0_task_tag_finished; else echo 1 > $REMOTE_ROOT/3ca1eb001ac5abfef83fbc97503bd8b332456743_flag_if_job_task_fail;fi fi & wait

cd $REMOTE_ROOT test $? -ne 0 && exit 1

wait FLAG_IF_JOB_TASK_FAIL=$(cat 3ca1eb001ac5abfef83fbc97503bd8b332456743_flag_if_job_task_fail) if test $FLAG_IF_JOB_TASK_FAIL -eq 0; then touch 3ca1eb001ac5abfef83fbc97503bd8b332456743_job_tag_finished; else exit 1;fi`

output file is slurp-238871.out, which is empty 👎

the output file ("output" in the srun line) is srun: error: spank: x11.so: Plugin file not found srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types srun: error: spank: x11.so: Plugin file not found srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types srun: error: spank: x11.so: Plugin file not found srun: error: mpi/pmix_v3: init: (null) [0]: mpi_pmix.c:139: pmi/pmix: can not load PMIx library srun: error: Couldn't load specified plugin name for mpi/pmix: Plugin init() callback failed srun: error: cannot create mpi context for mpi/pmix srun: error: invalid MPI type 'pmix', --mpi=list for acceptable types

yaomz16 commented 2 years ago

@yaomz16 For future reference:
❯ sacct -u mingzeya --starttime now-3days --format jobid,jobname,node,start,end,state | grep f010
238143       e729452e5+            f010 2022-08-29T16:10:07 2022-08-29T18:34:02  COMPLETED
238143.batch      batch            f010 2022-08-29T14:02:55 2022-08-29T18:34:02  COMPLETED
238143.exte+     extern            f010 2022-08-29T14:02:55 2022-08-29T18:34:02  COMPLETED
238143.1           pw.x            f010 2022-08-29T16:14:07 2022-08-29T18:34:02  COMPLETED
238357       3e016ac4e+            f010 2022-08-30T03:19:53 2022-08-30T05:50:41  COMPLETED
238357.batch      batch            f010 2022-08-30T03:19:53 2022-08-30T05:50:41  COMPLETED
238357.exte+     extern            f010 2022-08-30T03:19:53 2022-08-30T05:50:41  COMPLETED
238357.0           pw.x            f010 2022-08-30T03:21:37 2022-08-30T05:50:41  COMPLETED
243021       c23d1b1a4+            f010 2022-08-30T16:31:35 2022-08-30T16:34:16     FAILED
243021.batch      batch            f010 2022-08-30T16:31:35 2022-08-30T16:34:16     FAILED
243021.exte+     extern            f010 2022-08-30T16:31:35 2022-08-30T16:34:16  COMPLETED
243212          test.sh            f010 2022-08-30T22:19:30 2022-08-30T22:21:05     FAILED
243212.batch      batch            f010 2022-08-30T22:19:30 2022-08-30T22:21:05     FAILED
243212.exte+     extern            f010 2022-08-30T22:19:30 2022-08-30T22:21:05  COMPLETED
Can you provide the following (Attach file contents here, don't just provide a path):

Submission script

Output log file

Output error log

I'd like it for the following job ids:

238803 (Last success) - Please confirm this is the same job type

238871 (First fail)

238357

243021

The upside is if 238803 and 238871 are the bounding jobs, then whatever happened occurred between 2022-08-30T13:17:22 and 2022-08-30T13:19:11.

@emilannevelink Bummer, I was hoping a reboot would fix it. Looking at your jobs the past few days, they're all showing an exit code of 0 (success). Can you post your submission script?

and I'm so sorry that the 238803 is not the same job type... the actual last success one is 238357, which is a quantum espresso job generated by dpgen. However, as this job is successful, dpgen cleaned up the output log files...

yaomz16 commented 2 years ago

and 243021 seems to be a very small, short but succeeded gpu job generated by dpgen which does not use srun -pmix. Unfortunately DP-GEN cleaned up everything again... sorry

awadell1 commented 2 years ago

Have you tried running with a different pmi? For example, pmi2?

Not a fix, but maybe enough to get you limping along

emilannevelink commented 2 years ago

I get this error log when using --mpi=pmi2

srun: error: spank: x11.so: Plugin file not found
--------------------------------------------------------------------------
PMI2_Init failed to intialize.  Return code: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
PMI2_Init failed to intialize.  Return code: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
PMI2_Init failed to intialize.  Return code: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
PMI2_Init failed to intialize.  Return code: 1
--------------------------------------------------------------------------
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[d021:04073] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[d021:04074] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[d021:04071] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute. There are several options for building PMI support under
SLURM, depending upon the SLURM version you are using:

  version 16.05 or later: you can use SLURM's PMIx support. This
  requires that you configure and build SLURM --with-pmix.

  Versions earlier than 16.05: you must use either SLURM's PMI-1 or
  PMI-2 support. SLURM builds PMI-1 by default, or you can manually
  install PMI-2. You must then build Open MPI using --with-pmi pointing
  to the SLURM PMI library location.

Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[d021:04072] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
srun: error: d021: tasks 0-3: Exited with exit code 1

emilannevelink commented 2 years ago

I added the full stack trace to see what the error is following this bug report. It seems like the issue is that srun cannot find the pmix library see step 9 in attached url. In the stack trace it searches a long path, most of which seem to be in your (alex) home directory. I don't know how to find out which path pmix should actually be on, but that is a start to see if the library is in any of those. error.243260.zip

emilannevelink commented 2 years ago

slightly deeper, I tried looking in this path /home/awadell/.spack/opt/spack/linux-centos7-broadwell/gcc-11.2.0/pmix-3.2.3-knc4vugvfrt3kuvrpcxkzg6g5otqdzqe, but it seems to not exist, which is weird. I tried looking for a different pmix path and found /home/awadell/.spack/opt/spack/linux-centos7-broadwell/gcc-11.2.0/pmix-3.2.3-brq6e2b2bm6zazsnke5mulayxlv5ux47, but it seems empty which is also weird.

awadell1 commented 2 years ago

Okay, should be fixed now. Apologies for the mess, looks like we weren't as hermetic as we needed to be when building slurm.

I've built a hopefully equivalent version of pmix and symlinked it to the old path. Please let me know if it's fixed

yaomz16 commented 2 years ago

It is working! Thanks

ArjunaCluster / ArjunaUsers

Cannot use -pmix in sun #172

Your Name

Andrew ID

Where it Happened

What Happened?

Steps to reproduce

Job Submission Script

What I've tried

SBATCH --parsable

SBATCH --nodes 1

SBATCH --ntasks-per-node 54

SBATCH --gres=gpu:0

SBATCH --partition cpu

SBATCH --mem=108G

SBATCH -t 2-00:00:00

SBATCH -A venkvis

SBATCH --parsable

SBATCH --nodes 1

SBATCH --ntasks-per-node 54

SBATCH --gres=gpu:0

SBATCH --partition cpu

SBATCH --mem=108G

SBATCH -t 2-00:00:00

SBATCH -A venkvis

SBATCH --parsable

SBATCH --nodes 1

SBATCH --ntasks-per-node 54

SBATCH --gres=gpu:0

SBATCH --partition cpu

SBATCH --mem=108G

SBATCH -t 2-00:00:00

SBATCH -A venkvis