job-summary heuristics use case: general imbalanced I/O

shanedsnyder commented 3 years ago

We have recently discussed adding some heuristics to our job-summary tool to automatically detect certain behaviors in users' Darshan logs and to "grade" users on how they perform on a given heuristic. The idea is to automatically detect good/bad practices in user I/O workloads and to notify them about how good/bad they are doing.

Here, we start to flesh out a heuristic related to imbalanced I/O workloads. A Darshan log for this example and a brief README can be found here: https://github.com/darshan-hpc/darshan-logs/tree/main/imbalanced-io

General imbalanced I/O case

For this heuristic, it would be nice to capture generally how well-balanced application I/O workloads are. This metric could be calculated for MPI-IO and/or POSIX modules, using the same counters. It would likely be easiest to calculate on a per-file basis, rather than trying to aggregate data across files. Using the log file referenced above, here's an example of how to observe the imbalanced I/O using Darshan counters:

POSIX   -1      15708535418621378501    POSIX_FASTEST_RANK      465     /lus/theta-fs0/3981085427       /lus/theta-fs0  lustre
POSIX   -1      15708535418621378501    POSIX_FASTEST_RANK_BYTES        2072    /lus/theta-fs0/3981085427       /lus/theta-fs0  lustre
POSIX   -1      15708535418621378501    POSIX_SLOWEST_RANK      0       /lus/theta-fs0/3981085427       /lus/theta-fs0  lustre
POSIX   -1      15708535418621378501    POSIX_SLOWEST_RANK_BYTES        105876790000    /lus/theta-fs0/3981085427       /lus/theta-fs0  lustre
POSIX   -1      15708535418621378501    POSIX_F_FASTEST_RANK_TIME       0.106691        /lus/theta-fs0/3981085427       /lus/theta-fs0  lustre
POSIX   -1      15708535418621378501    POSIX_F_SLOWEST_RANK_TIME       583.149111      /lus/theta-fs0/3981085427       /lus/theta-fs0  lustre

Note that the fastest rank does considerably less I/O and takes much less time as compared to the slowest rank.

The variance counters could likely also be used, probably more reliably (i.e., there is no guarantee the slowest or fastest rank did the least/most I/O, though variance information definitively captures the disparity between processes for both I/O volume and I/O time):

POSIX   -1      15708535418621378501    POSIX_F_VARIANCE_RANK_TIME      683.862226      /lus/theta-fs0/3981085427       /lus/theta-fs0  lustre
POSIX   -1      15708535418621378501    POSIX_F_VARIANCE_RANK_BYTES     22555027473519628288.000000     /lus/theta-fs0/3981085427 /lus/theta-fs0  lustre

For MPI-IO applications, calculating POSIX-level imbalance could also be useful for insight into the MPI-IO collective buffering algorithm -- collective buffering algorithm is used to optimize some collective I/O workloads, by designating a subset of application processes as "aggregators" that actually perform I/O on behalf of non-aggregator processes. The idea is that aggregator processes can coalesce many per-process I/O requests into larger I/O requests that perform better, and also that designating aggregators limits client load on parallel file systems (only aggregators talk to the file system for read/write calls, non-aggregators do not issue these calls). See more details in Section 6 here (you can find lots of slides online detailing collective buffering algorithm, sometimes called "two phase I/O"): https://www.mcs.anl.gov/~thakur/papers/mpi-io-noncontig.pdf

I'm not sure whether the MPI-IO + POSIX analysis is always that useful, but it actually is important for the log file example we have for this use case. Avoiding getting into too much detail here (will open a second issue for another heuristic related to this), the MPI-IO collective buffering algorithm for this example picks a single aggregator process for doing all I/O, funneling all read/writes through this one single process. At the MPI-IO layer, this workload looks perfectly balanced, though the POSIX I/O workload is all ultimately funneled through one process. I only mention as we'll want to think more about what layers we want to perform the analysis on, though POSIX does seem like the easiest starting point.

Metric values:

best (green) score would be perfectly balanced I/O across all processes
worst (red) score would be all I/O through one process
some calculation to define a gradient for in-between values?

Gotchas:

The Darshan counters from above are from a traditional configuration of Darshan. Users do have the ability to disable Darshan's shared file reductions, though, in which case there isn't a single shared file record representing all processes' file access to a shared file, and in which case none of the counters mentioned above are captured. We probably need special logic to calculate imblance in cases where users have disabled shared file reductions -- in that case, we'd have to perform some analysis of each per-process record for the shared file in question to determine imbalance, rather than relying on variance or slowest/fastest rank counters.
The heuristics dashboard we've discussed has been at a per-job level, though this analysis would be most sensible at a per-file level. We probably want to have a way to identify the problematic files, in addition to our job-level "grade" for this particular metric, so users know where to focus their attention. Further, we might need some methods to relax this heuristic for small files (or files we otherwise don't really care about) so we don't have warnings related to inconsequential I/O?
Collective I/O as described above (i.e., using collective buffering algorithm) is definitionally imbalanced at the POSIX layer, but that is not necessarily a bad thing -- collective I/O can lead to performance gains, but does appear imbalanced, so we need to be careful about what message we are sending in that case.

Let me know what important details I missed and we can flesh out further.

roblatham00 commented 3 years ago

(if this is exclusively the "load imbalance" thread, I should probably move this somewhere else...)

You mentioned two-phase I/O: one imbalance mode we'd like to flag is when most (but not all) processes enter a collective, and then the collective has to stall until that final process arrives. We've used the heuristic of comparing MPIIO_F_SLOWEST_RANK_TIME and POSIX_F_SLOWEST_RANK_TIME -- if those values are "close", we can assume the clients all entered the collective at the same time. If MPI time is larger than POSIX time, two possibliities

the two phase communication was unusually expensive (we expect this situation to be exceedingly rare and also of great interest to MPI-IO developers!)
a straggler process (or several) were late arrivals to the collective routine.

carns commented 3 years ago

For the gradient (yellow) score I would suggest just picking some thresholds. We can calculate the average volume per rank and compare that to the SLOWEST_RANK_BYTES and FASTEST_RANK_BYTES and flag it yellow if they are of by a factor of 2. Use the MPI-IO counters if present, POSIX counters otherwise.

We will probably need to tinker with this, I'm just proposing a straw man to start with until we see some examples.

tylerjereddy commented 3 years ago

@nawtrey are you planning to tackle this one or both of them? maybe just drop a comment if you are tackling one of them and I'll do the same if time permits

nawtrey commented 3 years ago

@nawtrey are you planning to tackle this one or both of them? maybe just drop a comment if you are tackling one of them and I'll do the same if time permits

I'm taking a look at #444, I'll drop a comment there.

darshan-hpc / darshan

job-summary heuristics use case: general imbalanced I/O #443

General imbalanced I/O case