LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
102 stars 31 forks source link

PnetCDF large_coalesce test fails due to incorrect data on read (ROMIO problem) #752

Open adammoody opened 1 year ago

adammoody commented 1 year ago

The PnetCDF test/largefile/large_coalesce test returns the value of 0 for bytes that should have a non-zero value.

*** TESTING C   large_coalesce for skip filetype buftype coalesce  ------ 0 (at line 285): expect buf[1073741814]=97 but got 0
0 (at line 285): expect buf[1073741815]=98 but got 0
0 (at line 285): expect buf[1073741816]=99 but got 0
0 (at line 285): expect buf[1073741817]=100 but got 0
0 (at line 285): expect buf[1073741818]=101 but got 0
0 (at line 285): expect buf[1073741819]=102 but got 0
0 (at line 285): expect buf[1073741820]=103 but got 0
0 (at line 285): expect buf[1073741821]=104 but got 0
0 (at line 285): expect buf[1073741822]=105 but got 0
0 (at line 285): expect buf[1073741823]=106 but got 0
0 (at line 285): expect buf[1073741824]=107 but got 0
0 (at line 285): expect buf[1073741825]=108 but got 0
0 (at line 285): expect buf[1073741826]=109 but got 0
0 (at line 285): expect buf[1073741827]=110 but got 0
0 (at line 285): expect buf[1073741828]=111 but got 0
0 (at line 285): expect buf[1073741829]=112 but got 0
0 (at line 285): expect buf[1073741830]=113 but got 0
0 (at line 285): expect buf[1073741831]=114 but got 0
0 (at line 285): expect buf[1073741832]=115 but got 0
0 (at line 285): expect buf[1073741833]=116 but got 0
0 (at line 293): expect buf[2147483638]=65 but got 0
0 (at line 293): expect buf[2147483639]=66 but got 0
0 (at line 293): expect buf[2147483640]=67 but got 0
0 (at line 293): expect buf[2147483641]=68 but got 0
0 (at line 293): expect buf[2147483642]=69 but got 0
0 (at line 293): expect buf[2147483643]=70 but got 0
0 (at line 293): expect buf[2147483644]=71 but got 0
0 (at line 293): expect buf[2147483645]=72 but got 0
0 (at line 293): expect buf[2147483646]=73 but got 0
0 (at line 293): expect buf[2147483647]=74 but got 0
0 (at line 293): expect buf[2147483648]=75 but got 0
0 (at line 293): expect buf[2147483649]=76 but got 0
0 (at line 293): expect buf[2147483650]=77 but got 0
0 (at line 293): expect buf[2147483651]=78 but got 0
0 (at line 293): expect buf[2147483652]=79 but got 0
0 (at line 293): expect buf[2147483653]=80 but got 0
0 (at line 293): expect buf[2147483654]=81 but got 0
0 (at line 293): expect buf[2147483655]=82 but got 0
0 (at line 293): expect buf[2147483656]=83 but got 0
0 (at line 293): expect buf[2147483657]=84 but got 0

That is reported around this line:

https://github.com/Parallel-NetCDF/PnetCDF/blob/c7e22c81ac4c2922f84281a4a19f7000079e6c3f/test/largefile/large_coalesce.c#L284

This same test throws a segfault when using Lustre as the file system, so the test failure is not unique to UnifyFS.

Tracing under a debug build of MVAPICH2, the test hits an ADIOI assertion at this line:

https://github.com/pmodels/mpich/blob/5b88f46620607707201768f4b3df39907082f344/src/mpi/romio/adio/common/ad_read_str_naive.c#L311

The value req_len = 2147483126 fails the assertion check req_len == (int) req_len.

The stack trace at this point is:

main
ncmpi_wait_all
ncmpio_wait
req_commit
wait_getput
req_aggregation
mgetput
ncmpi_read_write
PMPI_File_read_at_all
MPIOI_File_read_all
ADIOI_GEN_ReadStridedColl
ADIOI_GEN_ReadStrided
ADIOI_GEN_ReadStrided_naive

Apparently, this "naive" read code path within ROMIO does not support requests larger than 2GB.

This is running as a single process test:

#!/bin/bash
set -x

nodes=$SLURM_NNODES
procs=$(($nodes * 1))

export UNIFYFS_MARGO_CLIENT_TIMEOUT=70000

export UNIFYFS_CONFIGFILE=/var/tmp/unifyfs.conf
touch $UNIFYFS_CONFIGFILE

srun --overlap -n $nodes -N $nodes mkdir /dev/shm/unifyfs
export UNIFYFS_LOGIO_SPILL_DIR=/dev/shm/unifyfs

export UNIFYFS_CLIENT_LOCAL_EXTENTS=1
export UNIFYFS_CLIENT_WRITE_SYNC=0

export UNIFYFS_LOG_VERBOSITY=1

# test_ncmpi_put_var1_schar executes many small writes,
# it was necessary to reduce the chunk size to avoid exhausing space
export UNIFYFS_LOG_DIR=`pwd`/logs
export UNIFYFS_LOGIO_CHUNK_SIZE=$(expr 1 \* 4096)
export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 1024 \* 1048576)
export UNIFYFS_LOGIO_SPILL_SIZE=$(expr 0 \* 1048576)

export UNIFYFS_CLIENT_SUPER_MAGIC=0

installdir="/path/to/unifyfs.git/install"
export LD_LIBRARY_PATH="${installdir}/lib:${installdir}/lib64:$LD_LIBRARY_PATH"

# turn of darshan profiling
export DARSHAN_DISABLE=1

# sleep for some time after unlink
# see https://github.com/LLNL/UnifyFS/issues/744
export UNIFYFS_CLIENT_UNLINK_USECS=1000000

export LD_PRELOAD="${installdir}/lib/libunifyfs_mpi_gotcha.so"

filename="/unifyfs/testfile.nc"

export UNIFYFS_LOGIO_SHMEM_SIZE=$(expr 8192 \* 1048576)
cd test/largefile
./large_coalesce $filename