darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 27 forks source link

add a "worst case" I/O cost per process graph to job-summary #306

Open shanedsnyder opened 3 years ago

shanedsnyder commented 3 years ago

In GitLab by @carns on Mar 8, 2021, 13:36

The current darshan-job-summary pdf includes a graph that looks like the attachment Screenshot_from_2021-03-08_14-28-32. It shows the average I/O cost per process, which means summing up read, write, and metadata time across all ranks and dividing by the number of ranks.

This is helpful in many cases, but it fails to highlight cases where the application is held back by one rank (or a few ranks) spending more time in I/O than the others. The worst case would be a job that does "rank 0 I/O" (i.e., all I/O from a single process of the job). On average this would look fine in this graph if there are enough ranks not doing I/O to amortize the cost.

We should add an additional graph that at least shows the worst case rank in terms of I/O percentage in addition to the average percentage.

carns commented 2 years ago

This has a dependency on https://github.com/darshan-hpc/darshan/issues/642 to generalize some of the aggregate calculations in the darshan-util library.

shanedsnyder commented 1 year ago

This is blocked on what is partially described in #155 -- we need to maintain slowest rank read, write, and meta time separately for shared files to support this fully. For now, we can only provide a "worst case" cost for all I/O components (read, write, meta) combined.