Closed adammoody closed 1 year ago
@adammoody Couple questions:
I think you answered my questions at the same time I posted them. Next question, I don't see anything in that script that's launching the servers.
Thanks, @MichaelBrim .
I'm launching the servers manually using the following script:
#!/bin/bash
module use /sw/frontier/unifyfs/modulefiles
#module load unifyfs/1.1/gcc-12.2.0
#module show unifyfs/1.1/gcc-12.2.0
module load gcc/12.2.0
module load PrgEnv-gnu
module unload darshan-runtime
set -x
installdir=/sw/frontier/unifyfs/spack/env/unifyfs-1.1/gcc-12.2.0/view
export LD_LIBRARY_PATH=${installdir}/lib:${installdir}/lib64:$LD_LIBRARY_PATH
procs=$SLURM_NNODES
srun -n $procs -N $procs touch /var/tmp/unifyfs.conf
export UNIFYFS_CONFIGFILE=/var/tmp/unifyfs.conf
export UNIFYFS_MARGO_CLIENT_TIMEOUT=700000
export UNIFYFS_MARGO_SERVER_TIMEOUT=800000
export UNIFYFS_SERVER_LOCAL_EXTENTS=0
export UNIFYFS_SHAREDFS_DIR=/lustre/orion/csc300/scratch/$USER
export UNIFYFS_DAEMONIZE=off
export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 \* 65536)
export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 64 \* 1048576)
export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 \* 1048576)
srun -n $procs -N $procs mkdir -p /dev/shm/unifyfs
export UNIFYFS_LOGIO_SPILL_DIR=/dev/shm/unifyfs
export UNIFYFS_LOG_DIR=`pwd`/logs
export UNIFYFS_LOG_VERBOSITY=5
export ABT_THREAD_STACKSIZE=256000
srun -n $procs -N $procs ${installdir}/bin/unifyfsd &
I execute that script to launch the servers, let things settle for about 10 seconds, and then run the earlier script to launch the application. I note here that I'm not loading the unifyfs module, but I'm directly pointing to the directory in LD_LIBRARY_PATH.
All of these tests are also trying to use shared memory only. I've pointed the spill directory to /dev/shm, but also set its size to 0. Don't know that this matters, but just pointing it out.
Any reason you're not using unifyfs start
to launch the servers? Its srun includes the options '--exact --overlap --ntasks-per-node=1', which may be necessary to successfully run. Also, what's your working directory when running - I ask since you're putting server logs in $PWD/logs, which would be bad if your still in the /ccs/proj/csc300 area, since it's read-only on compute nodes.
I was launching manually because I've been doing a lot of debugging with totalview. Sometimes I need to debug the servers and sometimes the client application. I haven't figured out the best way to do this with unifyfs start
, and so I tend to keep reusing these manual launch methods.
I'm running out of my home directory /ccs/home/$USER
. The log files do seem to show up. I've been using those to debug, as well.
I am unable to reproduce this issue in my environment using my normal unifyfs job setup that uses NVM, not shmem. Here's the successful app output I get.
> more mpiio-issue788-gotcha.out.*
::::::::::::::
mpiio-issue788-gotcha.out.frontier03321.2.0
::::::::::::::
0 of 2
0
0
::::::::::::::
mpiio-issue788-gotcha.out.frontier03322.2.1
::::::::::::::
1 of 2
0
0
Ok, good to know. Must be something in my environment. Thanks for testing, @MichaelBrim
@adammoody, would you consider this resolved?
Yes, let's close this one as resolved.
When running a two-process, two-node test on Frontier, it seems that MPI_File_open returns an error. I ran into this when PnetCDF tests were failing and I simplified down to this reproducer.
Built with:
Here is the script used to configure and launch. These settings probably don't matter, but I'll capture them just in case.
Running that, I get the following output:
I should have run those return codes through
MPI_Error_string
. Anyway, you can see the first integer printed by rank 0 is different from rank 1, so at least one of those got something other thanMPI_SUCCESS
, maybe both. In my PnetCDF test, rank 1 usually reports ENOENT while rank 0 detects that rank 1 failed and reports a more generic error.@MichaelBrim , are you able to reproduce this?