darshan-hpc / darshan

Darshan I/O characterization tool
Other
55 stars 27 forks source link

Failed with status=127 when running on Perlmutter #990

Closed wkliao closed 3 weeks ago

wkliao commented 1 month ago

I encountered an error when testing Darshan version 3.4.5 on Perlmutter.

% cat slurm-26001722.out
LD_PRELOAD=/global/homes/w/wkliao/Darshan/3.4.5/lib/libdarshan.so
CMD = srun -n 4 ./mpi_io_test /pscratch/sd/w/wkliao/dummy
slurmstepd: error: TaskProlog failed status=127
slurmstepd: error: TaskProlog failed status=127
slurmstepd: error: TaskProlog failed status=127
slurmstepd: error: TaskProlog failed status=127
srun: error: nid006785: tasks 0-3: Exited with exit code 127
srun: Terminating StepId=26001722.0

Here is my Darshan configure command:

DARSHAN_VERSION=3.4.5

../darshan-${DARSHAN_VERSION}/configure \
           --prefix=${HOME}/Darshan/${DARSHAN_VERSION} \
           --with-log-path=${HOME}/Darshan/LOG \
           --with-jobid-env=NONE \
           --silent \
           CC=cc

Job script file:

#!/bin/bash
#SBATCH -t 00:02:00
#SBATCH -N 1
#SBATCH -C cpu
#SBATCH --qos=debug

DARSHAN_DIR=$HOME/Darshan/3.4.5

export LD_LIBRARY_PATH=$DARSHAN_DIR/lib:$LD_LIBRARY_PATH
export LD_PRELOAD=$DARSHAN_DIR/lib/libdarshan.so

echo "LD_PRELOAD=$LD_PRELOAD"

CMD="srun -n 4 ./mpi_io_test $SCRATCH/dummy"
echo "CMD = $CMD"
$CMD

Test program:

#include "mpi.h"

int main(int argc, char *argv[])
{
    MPI_File fh;

    MPI_Init(&argc, &argv);

    MPI_File_open(MPI_COMM_WORLD, argv[1], MPI_MODE_RDWR | MPI_MODE_CREATE,
                  MPI_INFO_NULL, &fh);
    MPI_File_close(&fh);

    MPI_Finalize();
    return 0;
}
shanedsnyder commented 1 month ago

I don't currently have hours on Perlmutter to check (I'm working on getting some now), but one thing you could try is to use the traditional Darshan method for intercepting in Cray PEs (i.e., not LD_PRELOAD):

export MODULEPATH=/path/to/darshan/install/share/craype-2.x/modulefiles:$MODULEPATH
module load darshan
# build your application like normal and run without LD_PRELOAD
cc -o mpi-io-test mpi-io-test.c
mpirun -n 4 ./mpi-io-test

Maybe LD_PRELOAD is causing some problems within the Slurm prolog script. The above would ensure Darshan is only doing anything on the application executable.

We will look into the issue regardless, since LD_PRELOAD is still something that's necessary in some cases (e.g., for Python).

wkliao commented 1 month ago

The approach without setting LD_PRELOAD works. Since we are using LD_PRELOAD option, please let me know what I can test to help debug the problem.

shanedsnyder commented 3 weeks ago

I finally got around to checking this, but can't seem to reproduce the error you're getting, @wkliao.

I built Darshan the same way (I'm using main rather than the 3.4.5 tag, but those should be very similar), ran the same application using your same job script (with my updated Darshan install prefix). And I'm able to generate Darshan logs fine:

ssnyder@perlmutter:login31:~/software/darshan/darshan-dev/wk/build> cat slurm-26713969.out 
LD_PRELOAD=/global/homes/s/ssnyder/software/darshan/darshan-dev/wk/install//lib/libdarshan.so
CMD = srun -n 4 ./mpi-test /pscratch/sd/s/ssnyder/dummy
MPIIO WARNING: DVS stripe width of 24 was requested but DVS set it to 1
See MPICH_MPIIO_DVS_MAXNODES in the intro_mpi man page.
ssnyder@perlmutter:login31:~/software/darshan/darshan-dev/wk/build> ls ../logs/2024/6/12/
ssnyder_mpi-test_id2133151-2133151_6-12-37984-5229832166827853701_1.darshan

Any ideas on what the discrepancy could be? If it helps, here's what my loaded modules looks like (I haven't changed anything from the default when I log on):

ssnyder@perlmutter:login31:~/software/darshan/darshan-dev/wk/build> module list

Currently Loaded Modules:
  1) craype-x86-milan                        5) PrgEnv-gnu/8.5.0      9) craype/2.7.30           13) cudatoolkit/12.2
  2) libfabric/1.15.2.0                      6) cray-dsmml/0.2.2     10) gcc-native/12.3         14) craype-accel-nvidia80
  3) craype-network-ofi                      7) cray-libsci/23.12.5  11) perftools-base/23.12.0  15) gpu/1.0
  4) xpmem/2.6.2-2.5_2.38__gd067c3f.shasta   8) cray-mpich/8.1.28    12) cpe/23.12
wkliao commented 3 weeks ago

I just now tested on Perlmutter and no longer got the error. I tried the system's installed Darshan as well as my ROMIO library using the LD_PRELOAD approach. Looks like the system folks must have fixed the problem. Thanks for looking into this.

shanedsnyder commented 3 weeks ago

Not a problem! Glad it wasn't anything more complicated. I hadn't had a chance to run anything on Perlmutter for some time, so this was a good excuse to finally test Darshan there.