anaveroneze / ldms-darshan-analysis

0 stars 0 forks source link

ldms-darshan-analysis

Darshan-LDMS absolute timestamp

Explore running I/O benchmarks with system stressors to show that we can use the system timestamps collected by Darshan-LDMS to diagnose application bottlenecks automatically.

Using stress-ng for stressing the system: https://wiki.ubuntu.com/Kernel/Reference/stress-ng

sudo apt install stress-ng
stress-ng --matrix 1 -t 1m -v

Using mixed stressor classes:

stress-ng --cpu 0 --cpu-method fft &
stress-ng --cpu 0 --matrix-method frobenius &
stress-ng --tz --lsearch 0  --all 1 &

For a specific class:

stress-ng --class cpu --tz -v --all 4 & 
stress-ng --class cpu-cache --tz -v --all 4 &
stress-ng --class io --tz -v --all 4 &
stress-ng --class filesystem --tz -v --all 4 &
stress-ng --class memory --tz -v --all 4 &

We will collect LDMS-Darshan and system logs (meminfo, procstat, lustre, dstat).

First test:

  1. HACCIO with 50.000.000 particles and specific classes running in background during all execution:
    stress-ng --class cpu --tz -v --all 4 & 
    stress-ng --class cpu-cache --tz -v --all 4 &
    stress-ng --class io --tz -v --all 4 &
    stress-ng --class filesystem --tz -v --all 4 &
    stress-ng --class memory --tz -v --all 4 &

To finish the stressors running in background:

killall -2 stress-ng
  1. HACCIO with 100.000.000 particles (or other setup that exceeds 10sec execution) and specific classes running in background for 10 seconds:
    stress-ng --class cpu --tz -v --all 4 -t 10s & 
    stress-ng --class cpu-cache --tz -v --all 4 -t 10s &
    stress-ng --class io --tz -v --all 4 -t 10s &
    stress-ng --class filesystem --tz -v --all 4 -t 10s &
    stress-ng --class memory --tz -v --all 4 -t 10s &

Run one stressor for each run, so a total of 10 experiments + 2 normal executions (no stressors).

Experimental Design

Organizing experimental plan for reporting the benefits of having the absolute timestamp collected by Darshan-LDMS to identify I/O bottlenecks, by using the LDMS-Darshan information itself, and also by correlating with other system telemetry data. We will run I/O benchmarks and real applications with and without system stressors and evalutate the performance of the experiments using LDMS-Darshan logs.

Applications

We selected 3 microbenchmarks (HACCIO, IOR, IOZone) and 2 real applications (CANDLE and VPIC-IO):

cd ./benchmarks 
git clone https://github.com/glennklockwood/hacc-io.git
mv hacc-io/ haccio/
cd haccio/
make clean
make CXX=mpicxx

# Run for
mpirun -np 2 ./hacc_io 10000000 outputfile
mpirun -np 2 ./hacc_io 5000000 outputfile
git clone https://github.com/hpc/ior.git
cd ior/
./bootstrap
./configure
make
make check

# Run for
./src/ior -a POSIX
wget https://www.iozone.org/src/current/iozone3_493.tgz
tar -xzvf iozone3_493.tgz
rm *tgz
mv iozone3_493/ iozone
cd iozone/src/current/
make linux-AMD64

# Run for
./iozone -a
git clone https://github.com/lanl/vpic.git
cd vpic
mkdir build
cd build
../arch/reference-Release
make

# Run for the input deck
./bin/vpic ../sample/harris 
mpirun -n 2 ./harris.Linux --tpp 8
pip install -r requirements-candle.txt
wget https://github.com/ECP-CANDLE/Benchmarks/archive/refs/tags/v0.4.tar.gz
tar -xzvf v0.4.tar.gz
mv Benchmarks-0.4/ candle/
rm v0.4.tar.gz

# Run for
python3 ./candle/Pilot3/P3B3/p3b3_baseline_keras2.py

Using the configuring file (./Pilot3/P3B3/p3b3_default_model.txt):

[Global_Params]
data_url = 'ftp://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot3/'
train_data = 'P3B3_data.tar.gz'
model_name = 'p3b3'
learning_rate = 0.0001
batch_size = 10
epochs = 10
dropout = 0.5
optimizer = 'adam'
wv_len = 300
filter_sizes = 3
filter_sets = 3
num_filters = 100
emb_l2 = 0.001
w_l2 = 0.01
task_list = [0, 2, 3]
task_names = ['site', 'laterality', 'histology', 'grade']

Stressors

Using stress-ng for stressing the system: https://wiki.ubuntu.com/Kernel/Reference/stress-ng

Configuring for different parameters:

!#/bash/sh

declare -a arr=("cpu" "cpu-cache" "device" "io" "interrupt" "filesystem" "memory" "network" "os")

## now loop through the above array
for class_type in "${arr[@]}"
do
   echo "stress-ng --class $class_type --tz -v --seq 0 &"
   echo "stress-ng --class $class_type --tz -v --all 1 &"
   echo "stress-ng --class $class_type --tz -v --all 2 &"
   echo "stress-ng --class $class_type --tz -v --all 4 &"
done

Different setups for stressors:

stress-ng --class cpu --tz -v --seq 0 &
stress-ng --class cpu --tz -v --all 1 &
stress-ng --class cpu --tz -v --all 2 &
stress-ng --class cpu --tz -v --all 4 &
stress-ng --class cpu-cache --tz -v --seq 0 &
stress-ng --class cpu-cache --tz -v --all 1 &
stress-ng --class cpu-cache --tz -v --all 2 &
stress-ng --class cpu-cache --tz -v --all 4 &
stress-ng --class device --tz -v --seq 0 &
stress-ng --class device --tz -v --all 1 &
stress-ng --class device --tz -v --all 2 &
stress-ng --class device --tz -v --all 4 &
stress-ng --class io --tz -v --seq 0 &
stress-ng --class io --tz -v --all 1 &
stress-ng --class io --tz -v --all 2 &
stress-ng --class io --tz -v --all 4 &
stress-ng --class interrupt --tz -v --seq 0 &
stress-ng --class interrupt --tz -v --all 1 &
stress-ng --class interrupt --tz -v --all 2 &
stress-ng --class interrupt --tz -v --all 4 &
stress-ng --class filesystem --tz -v --seq 0 &
stress-ng --class filesystem --tz -v --all 1 &
stress-ng --class filesystem --tz -v --all 2 &
stress-ng --class filesystem --tz -v --all 4 &
stress-ng --class memory --tz -v --seq 0 &
stress-ng --class memory --tz -v --all 1 &
stress-ng --class memory --tz -v --all 2 &
stress-ng --class memory --tz -v --all 4 &
stress-ng --class network --tz -v --seq 0 &
stress-ng --class network --tz -v --all 1 &
stress-ng --class network --tz -v --all 2 &
stress-ng --class network --tz -v --all 4 &
stress-ng --class os --tz -v --seq 0 &
stress-ng --class os --tz -v --all 1 &
stress-ng --class os --tz -v --all 2 &
stress-ng --class os --tz -v --all 4 &

Job Execution Setup

Scontrol

The output of scontrol show job <job-id> of the HACC-IO execution with 10 million particles, 16 nodes and 32 tasks per node:

 Command=/projects/ovis/darshanConnector/common/darshan/darshan-test/regression/cray-module-common/slurm-submit.sl --ntasks-per-node=32 /projects/ovis/darshanConnector/apps/rhel9.7/hacc-io/hacc_io 10000000 /pscratch/<user>/haccTest/darshan
   WorkDir=/ceeprojects/ovis/darshanConnector/common/darshan/darshan-test/regression/test-cases
   StdErr=/pscratch/<user>/darshan-ldms-output/17301435-HACC_pscratch_10.err
   StdIn=/dev/null
   StdOut=/pscratch/<user>/darshan-ldms-output/17301435-HACC_pscratch_10.out

slurm-submit.sl

The slurm-submit.sl is where the stress-ng and job are started and is shown below:

#!/bin/bash -l
export PBS_JOBID=$SLURM_JOB_ID
export DARSHAN_LOGFILE=$DARSHAN_TMP/${PROG}.${PBS_JOBID}.darshan

mkdir /tmp/stress-tmp/
declare -a arr=("cpu")
#("cpu-cache" "device" "io" "interrupt" "filesystem" "memory" "network" "os")

for class_type in "${arr[@]}"
do
        {
        #taskset -c 0 stress-ng --temp-path /tmp/stress-tmp/ --class $class_type --tz -v --seq 0 | echo "$(date -d @$(date +%s.%N)): $class_type stressor started" &
        #taskset -c 0 stress-ng --temp-path /tmp/stress-tmp/ --class $class_type --tz -v --all 1 | echo "$(date -d @$(date +%s.%N)): $class_type stressor started" &
        #taskset -c 0 stress-ng --temp-path /tmp/stress-tmp/ --class $class_type --tz -v --all 2 | echo "$(date -d @$(date +%s.%N)): $class_type stressor started" &
        #taskset -c 0 stress-ng --temp-path /tmp/stress-tmp/ --class $class_type --tz -v --all 4 | echo "$(date -d @$(date +%s.%N)): $class_type stressor started" &

        #taskset -c 0 stress-ng --temp-path /tmp/stress-tmp/ --class $class_type --tz -v --seq 0 -t 10s | echo "$(date -d @$(date +%s.%N)): $class_type stressor started" &
        #taskset -c 0 stress-ng --temp-path /tmp/stress-tmp/ --class $class_type --tz -v --all 1 -t 10s | echo "$(date -d @$(date +%s.%N)): $class_type stressor started" &
        #taskset -c 0 stress-ng --temp-path /tmp/stress-tmp/ --class $class_type --tz -v --all 2 -t 10s | echo "$(date -d @$(date +%s.%N)): $class_type stressor started" &
        taskset -c 0 stress-ng --temp-path /tmp/stress-tmp/ --class $class_type --tz -v --all 4 -t 30s | echo "$(date -d @$(date +%s.%N)): $class_type stressor started" &

        } 2> $DARSHAN_TMP/stress-ng.${PBS_JOBID}.err
done

START=$(date +%s.%N)
echo "$(date -d @$(date +%s.%N)): Application Started"
srun --mpi=pmi2 $@
echo "$(date -d @$(date +%s.%N)): Application Ended"
END=$(date +%s.%N)

killall -2 stress-ng
#| echo "$(date -d @$(date +%s.%N)): Stressor Killed"

DIFF=$(echo "$END - $START" | bc)
echo "The DiffOfTime = $DIFF"

lfs getstripe /nscratch/spwalto/iorTest/darshan* > $DARSHAN_TMP/${PBS_JOBID}-${PROG}-OST.txt

# parse log with the dxt parser
$DARSHAN_PATH/bin/darshan-dxt-parser --show-incomplete $DARSHAN_LOGFILE > $DARSHAN_TMP/${PROG}.${PBS_JOBID}-dxt.darshan.txt
if [ $? -ne 0 ]; then
    echo "Error: failed to parse ${DARSHAN_LOGFILE} for dxt tracing" 1>&2
    exit 1
fi

# parse log with normal parser
$DARSHAN_PATH/bin/darshan-parser --all $DARSHAN_LOGFILE > $DARSHAN_TMP/${PROG}.${PBS_JOBID}.darshan.txt
if [ $? -ne 0 ]; then
    echo "Error: failed to parse ${DARSHAN_LOGFILE}" 1>&2
    exit 1
fi

Where:

Slurm Output

Below is a slurm output example for HACC-IO with 1 node, 10 million particles and 32 tasks:

cat /pscratch/<user>/darshan-ldms-output/17301901-HACC_pscratch_10.out
Tue Sep 26 14:14:57 MDT 2023: Stressor Started
-------- Aggregate Performance --------
 WRITE Checkpoint Perf: 1954.54 BW[MB/s] 12965306368 Bytes 6.32612 MaxTime[sec]
-------- Aggregate Performance --------
 READ Restart Perf: 8973.73 BW[MB/s] 12965306368 Bytes 1.37788 MaxTime[sec]
 CONTENTS VERIFIED... Success
Tue Sep 26 14:19:30 MDT 2023: Stressor Killed
The DiffOfTime = 272.390139881
306M    /pscratch/<user>/haccTest/darshan-Part00000000-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000001-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000002-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000003-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000004-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000005-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000006-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000007-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000008-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000009-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000010-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000011-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000012-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000013-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000014-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000015-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000016-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000017-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000018-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000019-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000020-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000021-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000022-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000023-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000024-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000025-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000026-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000027-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000028-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000029-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000030-of-00000032.data
306M    /pscratch/<user>/haccTest/darshan-Part00000031-of-00000032.data

Data collected

Each experiment will collect Darshan-LDMS and LDMS system utilization data.

[IN PROGRESS] Experiments with IOR

We want to show how LDMS-Darshan logs can be used to detect I/O bottlenecks at runtime and to diagnose it in real time. We will prove it by running applications with and without stressors to show, for example, that a shared system can impact I/O performance.

Eclipse specs

Post-processing logs:

Add header to the csvs:

sed -i '1s/^/uid,exe,job_id,rank,ProducerName,file,record_id,module,type,max_byte,switches,flushes,cnt,op,pt_sel,irreg_hslab,reg_hslab,ndims,npoints,off,len,start,dur,total,timestamp\n/' 17324718-IOR_pscratch_22.csv

Create CSV from DXT:

# Remove headers
sed -i '/^#/d' $filename
# Remove multiple spaces and replace with one
cat $filename | tr -s ' ' > tmp.csv
# Replace space with comma
sed -e 's/\s\+/,/g' tmp.csv > $filename
sed -i '/^$/d' $filename

# Add header:
n,Module,Rank,Wt/Rd,Segment,Offset,Length,Start,End

cat file1 >> file2

Install Darshan locally:

wget https://ftp.mcs.anl.gov/pub/darshan/releases/darshan-3.4.4.tar.gz
tar -xzvf darshan-3.4.4.tar.gz
mkdir build/ && mkdir build/logs
./prepare.sh && cd build/
../configure --with-log-path-by-env=LOGFILE_PATH_DARSHAN --prefix=/<path-to-darshan>/build/install --with-jobid-env=PBS_JOBID CC=mpicc && make && make install

To run IOR or other apps collecting traces:

export LD_PRELOAD=/path-to-darshan-install/lib/libdarshan.so
export LOGFILE_PATH_DARSHAN=<path-to-darshan>/build/logs
./ior -i 6 -b 144k -t 24k -s 1024 -F -C -e -k -o /pscratch/user/iorTest/darshan

Parse results to textual format:

darshan-parser $LOGFILE_PATH_DARSHAN/filename.darshan > $LOGFILE_PATH_DARSHAN/filename.txt

Runs with different FS and ranks

  1. 3 runs (17534968, 17534969, 17534976) with 32 ranks: short_runs/ folders. Command:
/projects/ovis/darshanConnector/apps/rhel9.7/ior/src/ior -i 6 -b 144k -t 24k -s 1024 -F -C -e -k -o /pscratch/spwalto/iorTest/darshan
  1. 3 runs (17532847, 17532848, 17532849) with the output file writing to memory (i.e. tmp/) and 36 ranks attached to a core (i.e. --cpu-bind=v,RANK).
/projects/ovis/darshanConnector/apps/rhel9.7/ior/src/ior -i 6 -b 2m -t 1m -s 6 -F -C -e -k -o /tmp/tmp.dJNPwjm8QH
  1. 3 runs (17533880, 17533919, 17533963) with the output file writing to pscratch filesystem and 36 ranks attached to a core (i.e. --cpu-bind=v,RANK).
/projects/ovis/darshanConnector/apps/rhel9.7/ior/src/ior -i 6 -b 4m -t 2m -s 1024 -F -C -e -k -o /pscratch/user/iorTest/darshan

Analysis and Visualization

Darshan-LDMS + System usage

Focusing on the first iteration of the first execution:

image info

image info

There are a few interesting bottlenecks we can try to identify the source in this case:

  1. Some longer operations
  2. Long pause before closing after writing
  3. Some ranks finishing way faster than others
  4. There is no operations between the end of the last read close and the opening of files for write. The application is doing something

In an ideal automatic tool, it would have to check:

  1. Identify longer operations
  2. Identify long intervals between last read/write and a met operation open/close
  3. Identify distance between the first rank to finish and others
  4. Identify long intervals between operations in the same rank

I will start doing manual correlation of system logs for specific time ranges we found application bottlenecks using Darshan-LDMS so we learn which parameters are interesting and can implement the Python code to make it automatically.

Lustre logs:

image info

The opens are very fast and instant, but the closes are not, as we saw, some ranks take an interval before closing the file. The inode results are expected, with an increase before starting writing and reading in the file. The number of bytes written and read also follow the application pattern collected by LDMS.

The interesting results are for the number of synchronization operations that increase when the application is performing the synchronization before the reading phase, indicating that these interval without I/O is due to this synchronization to ensure all ranks closed the file. For the seeks, that increased gradually showing that IOR is exploring the disk access during all execution when reads and writes are executing. The last are the closes behavior, which shows that some closes took a long time to finish.