darshan-hpc / darshan

Darshan I/O characterization tool
Other
55 stars 27 forks source link

Darshan is reporting negative values for MPIIO_BYTES_WRITTEN (seen on Frontier) #957

Open lukebroskop opened 10 months ago

lukebroskop commented 10 months ago

Writing a 4.3TB file to the orion filesystem attached to Frontier. For most of the ranks, Darshan is reporting negative values for MPIIO_BYTES_WRITTEN. e.g.

MPI-IO  0   7703102304952938401 MPIIO_BYTES_WRITTEN -32574  /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d   /lustre/orion   lustre
MPI-IO  1   7703102304952938401 MPIIO_BYTES_WRITTEN -32726  /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d   /lustre/orion   lustre
MPI-IO  2   7703102304952938401 MPIIO_BYTES_WRITTEN -64988  /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d   /lustre/orion   lustre
MPI-IO  3   7703102304952938401 MPIIO_BYTES_WRITTEN -32646  /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d   /lustre/orion   lustre
MPI-IO  4   7703102304952938401 MPIIO_BYTES_WRITTEN -32494  /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d   /lustre/orion   lustre
MPI-IO  5   7703102304952938401 MPIIO_BYTES_WRITTEN -32438  /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d   /lustre/orion   lustre
MPI-IO  6   7703102304952938401 MPIIO_BYTES_WRITTEN -65532  /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d   /lustre/orion   lustre
MPI-IO  8   7703102304952938401 MPIIO_BYTES_WRITTEN -32766  /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d   /lustre/orion   lustre

reproducer:

The following build/test works on Frontier

  1. Get code

git clone https://github.com/NWChemEx-Project/TAMM.git  

  1. setup environment
ml rocm/5.5.1
ml cce/16.0.0
ml cray-mpich/8.1.26
ml cray-libsci/23.05.1.4
ml craype/2.7.21
ml cray-hdf5-parallel/1.12.2.3
ml cmake
  1. Build

    cd TAMM
    mkdir build && cd build
    export TAMM_INSTALL_PATH=<your install path>
    CC=cc CXX=CC FC=ftn cmake -DCMAKE_INSTALL_PREFIX=$TAMM_INSTALL_PATH -DUSE_HIP=ON -DROCM_ROOT=$ROCM_PATH -DGPU_ARCH=gfx90a -DGCCROOT=/opt/gcc/12.2.0/snos -DBLAS_INT4=ON -DHDF5_ROOT=$HDF5_ROOT ..
    make -j20

    The make step should take about 3-4 minutes

  2. Test, from your build directory run:

    srun -A <your prodjuct account> -n800 -N50 -qdebug -B 1:7:2 --hint=multithread --ntasks-per-node=16 -t10:00  build/TAMM_Tests_External-prefix/src/TAMM_Tests_External-build/Test_IO 3643

The darshan output is sent to: /lustre/orion/darshan/frontier/YR/M/D/${USER}_*_id${JOBID}*