Open lukebroskop opened 1 year ago
Writing a 4.3TB file to the orion filesystem attached to Frontier. For most of the ranks, Darshan is reporting negative values for MPIIO_BYTES_WRITTEN. e.g.
MPI-IO 0 7703102304952938401 MPIIO_BYTES_WRITTEN -32574 /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d /lustre/orion lustre MPI-IO 1 7703102304952938401 MPIIO_BYTES_WRITTEN -32726 /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d /lustre/orion lustre MPI-IO 2 7703102304952938401 MPIIO_BYTES_WRITTEN -64988 /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d /lustre/orion lustre MPI-IO 3 7703102304952938401 MPIIO_BYTES_WRITTEN -32646 /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d /lustre/orion lustre MPI-IO 4 7703102304952938401 MPIIO_BYTES_WRITTEN -32494 /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d /lustre/orion lustre MPI-IO 5 7703102304952938401 MPIIO_BYTES_WRITTEN -32438 /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d /lustre/orion lustre MPI-IO 6 7703102304952938401 MPIIO_BYTES_WRITTEN -65532 /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d /lustre/orion lustre MPI-IO 8 7703102304952938401 MPIIO_BYTES_WRITTEN -32766 /lustre/orion/ven114/scratch/lukebr/TAMM/tensor3d /lustre/orion lustre
The following build/test works on Frontier
git clone https://github.com/NWChemEx-Project/TAMM.git
ml rocm/5.5.1 ml cce/16.0.0 ml cray-mpich/8.1.26 ml cray-libsci/23.05.1.4 ml craype/2.7.21 ml cray-hdf5-parallel/1.12.2.3 ml cmake
Build
cd TAMM mkdir build && cd build export TAMM_INSTALL_PATH=<your install path> CC=cc CXX=CC FC=ftn cmake -DCMAKE_INSTALL_PREFIX=$TAMM_INSTALL_PATH -DUSE_HIP=ON -DROCM_ROOT=$ROCM_PATH -DGPU_ARCH=gfx90a -DGCCROOT=/opt/gcc/12.2.0/snos -DBLAS_INT4=ON -DHDF5_ROOT=$HDF5_ROOT .. make -j20
The make step should take about 3-4 minutes
Test, from your build directory run:
srun -A <your prodjuct account> -n800 -N50 -qdebug -B 1:7:2 --hint=multithread --ntasks-per-node=16 -t10:00 build/TAMM_Tests_External-prefix/src/TAMM_Tests_External-build/Test_IO 3643
The darshan output is sent to: /lustre/orion/darshan/frontier/YR/M/D/${USER}_*_id${JOBID}*
/lustre/orion/darshan/frontier/YR/M/D/${USER}_*_id${JOBID}*
Writing a 4.3TB file to the orion filesystem attached to Frontier. For most of the ranks, Darshan is reporting negative values for MPIIO_BYTES_WRITTEN. e.g.
reproducer:
The following build/test works on Frontier
git clone https://github.com/NWChemEx-Project/TAMM.git
Build
The make step should take about 3-4 minutes
Test, from your build directory run:
The darshan output is sent to:
/lustre/orion/darshan/frontier/YR/M/D/${USER}_*_id${JOBID}*