MPI_File_open fails on two nodes on Frontier

adammoody commented 1 year ago

When running a two-process, two-node test on Frontier, it seems that MPI_File_open returns an error. I ran into this when PnetCDF tests were failing and I simplified down to this reproducer.

#include <stdio.h>
#include "mpi.h"

int main(int argc, char* argv[])
{
  int rc;

  char filename[] = "/unifyfs/foo";

  MPI_Init(&argc, &argv);

  int rank, ranks;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &ranks);
  printf("%d of %d\n", rank, ranks);
  fflush(stdout);

  MPI_File fh;
  int amode = MPI_MODE_CREATE | MPI_MODE_RDWR;
  rc = MPI_File_open(MPI_COMM_WORLD, filename, amode, MPI_INFO_NULL, &fh);
  printf("%d\n", rc);
  fflush(stdout);

  rc = MPI_File_close(&fh);
  printf("%d\n", rc);
  fflush(stdout);

  MPI_Finalize();
  return 0;
}

Built with:

#!/bin/bash
module use /sw/frontier/unifyfs/modulefiles
module load unifyfs/1.1/gcc-12.2.0
module load gcc/12.2.0
module load PrgEnv-gnu
module unload darshan-runtime

mpicc -o mpiopen mpiopen.c

Here is the script used to configure and launch. These settings probably don't matter, but I'll capture them just in case.

#!/bin/bash
# salloc -N 2 -p batch

installdir=/sw/frontier/unifyfs/spack/env/unifyfs-1.1/gcc-12.2.0/view

# disable data sieving
#>>: cat romio_hints.txt 
#romio_ds_read disable
#romio_ds_write disable
export ROMIO_HINTS=`pwd`/romio_hints.txt

# https://www.nersc.gov/assets/Uploads/MPI-Tips-rgb.pdf
#export MPICH_MPIIO_HINTS_DISPLAY=1
export MPICH_MPIIO_HINTS="romio_ds_read=disable,romio_ds_write=disable"

# http://cucis.eecs.northwestern.edu/projects/PnetCDF/doc/pnetcdf-c/PNETCDF_005fHINTS.html
export PNETCDF_HINTS="romio_ds_read=disable;romio_ds_write=disable"

export UNIFYFS_MARGO_CLIENT_TIMEOUT=70000

export UNIFYFS_CONFIGFILE=/var/tmp/unifyfs.conf
touch $UNIFYFS_CONFIGFILE

export UNIFYFS_CLIENT_LOCAL_EXTENTS=0
export UNIFYFS_CLIENT_WRITE_SYNC=0
export UNIFYFS_CLIENT_SUPER_MAGIC=0

# sleep for some time after unlink
# see https://github.com/LLNL/UnifyFS/issues/744
export UNIFYFS_CLIENT_UNLINK_USECS=1000000

srun --overlap -n 2 -N 2 mkdir -p /dev/shm/unifyfs
export UNIFYFS_LOGIO_SPILL_DIR=/dev/shm/unifyfs

# test_ncmpi_put_var1_schar executes many small writes,
# it was necessary to reduce the chunk size to avoid exhausing space
export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 \* 4096)
export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 1024 \* 1048576)
export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 \* 1048576)

export UNIFYFS_LOG_DIR=`pwd`/logs
export UNIFYFS_LOG_VERBOSITY=1

export LD_LIBRARY_PATH="${installdir}/lib:${installdir}/lib64:$LD_LIBRARY_PATH"

# turn off darshan profiling
export DARSHAN_DISABLE=1

export LD_PRELOAD="${installdir}/lib/libunifyfs_mpi_gotcha.so"
srun --label --overlap -n 2 -N 2 ./mpiopen

Running that, I get the following output:

+ srun --label --overlap -n 2 -N 2 ./mpiopen
1: 1 of 2
0: 0 of 2
1: 1006679845
1: 201911579
0: 469947936
0: 201911579

I should have run those return codes through MPI_Error_string. Anyway, you can see the first integer printed by rank 0 is different from rank 1, so at least one of those got something other than MPI_SUCCESS, maybe both. In my PnetCDF test, rank 1 usually reports ENOENT while rank 0 detects that rank 1 failed and reports a more generic error.

@MichaelBrim , are you able to reproduce this?

MichaelBrim commented 1 year ago

@adammoody Couple questions:

I don't see link against libunifyfs_mpi_gotcha in your build command. Is that just an oversight when submitting the issue?
Did you allocate NVM resources to your job (i.e., using '-C nvme') and then use srun? Without the NVM option, the module-provided setting for UNIFYFS_LOGIO_SPILL_DIR (/mnt/bb/$USER) won't exist.

MichaelBrim commented 1 year ago

I think you answered my questions at the same time I posted them. Next question, I don't see anything in that script that's launching the servers.

adammoody commented 1 year ago

Thanks, @MichaelBrim .

I'm launching the servers manually using the following script:

#!/bin/bash

module use /sw/frontier/unifyfs/modulefiles
#module load unifyfs/1.1/gcc-12.2.0
#module show unifyfs/1.1/gcc-12.2.0

module load gcc/12.2.0
module load PrgEnv-gnu

module unload darshan-runtime

set -x

installdir=/sw/frontier/unifyfs/spack/env/unifyfs-1.1/gcc-12.2.0/view

export LD_LIBRARY_PATH=${installdir}/lib:${installdir}/lib64:$LD_LIBRARY_PATH

procs=$SLURM_NNODES

srun -n $procs -N $procs touch /var/tmp/unifyfs.conf
export UNIFYFS_CONFIGFILE=/var/tmp/unifyfs.conf

export UNIFYFS_MARGO_CLIENT_TIMEOUT=700000
export UNIFYFS_MARGO_SERVER_TIMEOUT=800000

export UNIFYFS_SERVER_LOCAL_EXTENTS=0

export UNIFYFS_SHAREDFS_DIR=/lustre/orion/csc300/scratch/$USER

export UNIFYFS_DAEMONIZE=off

export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 \* 65536)
export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 64 \* 1048576)
export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 \* 1048576)

srun -n $procs -N $procs mkdir -p /dev/shm/unifyfs
export UNIFYFS_LOGIO_SPILL_DIR=/dev/shm/unifyfs

export UNIFYFS_LOG_DIR=`pwd`/logs
export UNIFYFS_LOG_VERBOSITY=5

export ABT_THREAD_STACKSIZE=256000

srun -n $procs -N $procs ${installdir}/bin/unifyfsd &

I execute that script to launch the servers, let things settle for about 10 seconds, and then run the earlier script to launch the application. I note here that I'm not loading the unifyfs module, but I'm directly pointing to the directory in LD_LIBRARY_PATH.

All of these tests are also trying to use shared memory only. I've pointed the spill directory to /dev/shm, but also set its size to 0. Don't know that this matters, but just pointing it out.

MichaelBrim commented 1 year ago

Any reason you're not using unifyfs start to launch the servers? Its srun includes the options '--exact --overlap --ntasks-per-node=1', which may be necessary to successfully run. Also, what's your working directory when running - I ask since you're putting server logs in $PWD/logs, which would be bad if your still in the /ccs/proj/csc300 area, since it's read-only on compute nodes.

adammoody commented 1 year ago

I was launching manually because I've been doing a lot of debugging with totalview. Sometimes I need to debug the servers and sometimes the client application. I haven't figured out the best way to do this with unifyfs start, and so I tend to keep reusing these manual launch methods.

I'm running out of my home directory /ccs/home/$USER. The log files do seem to show up. I've been using those to debug, as well.

MichaelBrim commented 1 year ago

I am unable to reproduce this issue in my environment using my normal unifyfs job setup that uses NVM, not shmem. Here's the successful app output I get.

> more mpiio-issue788-gotcha.out.*
::::::::::::::
mpiio-issue788-gotcha.out.frontier03321.2.0
::::::::::::::
0 of 2
0
0
::::::::::::::
mpiio-issue788-gotcha.out.frontier03322.2.1
::::::::::::::
1 of 2
0
0

adammoody commented 1 year ago

Ok, good to know. Must be something in my environment. Thanks for testing, @MichaelBrim

CamStan commented 10 months ago

@adammoody, would you consider this resolved?

adammoody commented 10 months ago

Yes, let's close this one as resolved.

LLNL / UnifyFS

MPI_File_open fails on two nodes on Frontier #788