Failure to hit stonewall while running IO-500 benchmark in extended mode.

mrashid2 commented 3 years ago

In my experiment, when I tried to run the IO-500 benchmark in the extended mode, it didn't hit the stonewall time of 300s for ior-easy-write phase. In ior-easy-write phase, the write operations were continuing up until, the cluster storage space was about to exhaust. The same thing occured for test phase ior-hard-write too. I haven't faced the problem when I ran in standard mode.

I am providing below one of the example instance from my local experiment where the ior-easy-write took very long time (the benchmark configuration was default one for the ior-easy-write phase):

result_summary.txt:

[ior-easy-write]
t_start         = 2021-05-10 01:26:16
exe             = ./ior -C -Q 1 -g -G 761181371 -k -e -o ./datafiles/2021.05.10-01.26.15/ior-easy/ior_file_easy -O stoneWallingStatusFile=./results/2021.05.10-01.26.15/ior-easy.stonewall -t 2m -b 9920000m -F -w -D 300 -O stoneWallingWearOut=1 -a POSIX -O saveRankPerformanceDetailsCSV=./results/2021.05.10-01.26.15/ior-easy-write.csv
throughput-stonewall = 1.27
score           = 0.396580
t_delta         = 3233.3158
t_end           = 2021-05-10 02:20:09

ior-easy-write.txt:

IOR-3.4.0+dev: MPI Coordinated Test of Parallel I/O
Began               : Mon May 10 01:26:16 2021
Command line        : ./ior -C -Q 1 -g -G 761181371 -k -e -o ./datafiles/2021.05.10-01.26.15/ior-easy/ior_file_easy -O stoneWallingStatusFile=./results/2021.05.10-01.26.15/ior-easy.stonewall -t 2m -b 9920000m -F -w -D 300 -O stoneWallingWearOut=1 -a POSIX -O saveRankPerformanceDetailsCSV=./results/2021.05.10-01.26.15/ior-easy-write.csv
Machine             : Linux node1.mrashid2-qv98443.dirr-pg0.wisc.cloudlab.us
TestID              : 0
StartTime           : Mon May 10 01:26:16 2021
Path                : ./datafiles/2021.05.10-01.26.15/ior-easy/ior_file_easy.00000000
FS                  : 1.5 TiB   Used FS: 0.0%   Inodes: 23.6 Mi   Used Inodes: 0.0%

Options: 
api                 : POSIX
apiVersion          : 
test filename       : ./datafiles/2021.05.10-01.26.15/ior-easy/ior_file_easy
access              : file-per-process
type                : independent
segments            : 1
ordering in a file  : sequential
ordering inter file : constant task offset
task offset         : 1
nodes               : 4
tasks               : 4
clients per node    : 1
repetitions         : 1
xfersize            : 2 MiB
blocksize           : 9.46 TiB
aggregate filesize  : 37.84 TiB
stonewallingTime    : 300
stoneWallingWearOut : 1

Results: 

access    bw(MiB/s)  IOPS       Latency(s)  block(KiB) xfer(KiB)  open(s)    wr/rd(s)   close(s)   total(s)   iter
------    ---------  ----       ----------  ---------- ---------  --------   --------   --------   --------   ----
stonewalling pairs accessed min: 9763 max: 164128 -- min data: 19.1 GiB mean data: 95.5 GiB time: 301.4s
WARNING: Expected aggregate file size       = 41607495680000
WARNING: Stat() of aggregate file size      = 1376805453824
WARNING: Using actual aggregate bytes moved = 1376805453824
WARNING: Maybe caused by deadlineForStonewalling
write     406.10     203.05     0.000061    10158080000 2048.00    0.001297   3233.26    0.003247   3233.27    0

ior-easy.stonewall:

164128

JulianKunkel commented 3 years ago

Actually, this is the first phase and the same setting for extended and standard. Therefore, I believe the problem is to seek somewhere else: As you see in this line: stonewalling pairs accessed min: 9763 max: 164128 -- min data: 19.1 GiB mean data: 95.5 GiB time: 301.4s The balance between fastest and slowest client is 1:5, the slowest client needs to catch up 80 GiB. Given that it took 300s to do 20, we would assume it takes in total ~1200s. Your run takes 1 hour. I have seen such unbalanced behavior before and there are various reasons for it. I believe if you rerun the standard run several times you likely will trigger such behavior as well and that this is not an issue of the extended run.

wszoecs commented 3 years ago

One point is, that you're using buffered-IO - which as you'd expect buffers a lot. (try --posix.odirect) (and because direct_io at 1m xfers is slow - increase the xfer-size to 64m then) The second point is indeed a real 'stream unfairness/imblance' on Lustre. real-world observation show, some streams get penalized up to 10x even. (but I've rarely seen more than 10x) The clients or the OSSs have AFAIK no kind of per-client or per-stream tracking/scheduling for requests ... As a counter-example - GPFS is extremely 'fair' here ... I've never seen any significant imbalance there.

so yes, this behavior is true ... and 'the art' is, to chose a certain node/ppn value, which makes it less peanlizing. The node/ppn count depends on the number of OSSs. It's always a huge try+error effort (sweep various node/ppn counts) - to find a decent balance here.

JulianKunkel commented 2 years ago

To sum up: This is no bug to the benchmark but to the storage. Recommendation for now is to reduce the number of work elements for the respective benchmark, e.g., segment size to a value that will still run > 300s but not cause the trouble. Had to do similarly on a HDD based system.

IO500 / io500