Closed mrashid2 closed 2 years ago
Actually, this is the first phase and the same setting for extended and standard. Therefore, I believe the problem is to seek somewhere else: As you see in this line: stonewalling pairs accessed min: 9763 max: 164128 -- min data: 19.1 GiB mean data: 95.5 GiB time: 301.4s The balance between fastest and slowest client is 1:5, the slowest client needs to catch up 80 GiB. Given that it took 300s to do 20, we would assume it takes in total ~1200s. Your run takes 1 hour. I have seen such unbalanced behavior before and there are various reasons for it. I believe if you rerun the standard run several times you likely will trigger such behavior as well and that this is not an issue of the extended run.
One point is, that you're using buffered-IO - which as you'd expect buffers a lot. (try --posix.odirect) (and because direct_io at 1m xfers is slow - increase the xfer-size to 64m then) The second point is indeed a real 'stream unfairness/imblance' on Lustre. real-world observation show, some streams get penalized up to 10x even. (but I've rarely seen more than 10x) The clients or the OSSs have AFAIK no kind of per-client or per-stream tracking/scheduling for requests ... As a counter-example - GPFS is extremely 'fair' here ... I've never seen any significant imbalance there.
so yes, this behavior is true ... and 'the art' is, to chose a certain node/ppn value, which makes it less peanlizing. The node/ppn count depends on the number of OSSs. It's always a huge try+error effort (sweep various node/ppn counts) - to find a decent balance here.
To sum up: This is no bug to the benchmark but to the storage. Recommendation for now is to reduce the number of work elements for the respective benchmark, e.g., segment size to a value that will still run > 300s but not cause the trouble. Had to do similarly on a HDD based system.
In my experiment, when I tried to run the IO-500 benchmark in the extended mode, it didn't hit the stonewall time of 300s for ior-easy-write phase. In ior-easy-write phase, the write operations were continuing up until, the cluster storage space was about to exhaust. The same thing occured for test phase ior-hard-write too. I haven't faced the problem when I ran in standard mode.
I am providing below one of the example instance from my local experiment where the ior-easy-write took very long time (the benchmark configuration was default one for the ior-easy-write phase):
result_summary.txt:
ior-easy-write.txt:
ior-easy.stonewall:
164128