darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 27 forks source link

ENH: Lustre OST mismatch "traffic light" warning #601

Open tylerjereddy opened 2 years ago

tylerjereddy commented 2 years ago

Mentioned in meeting today..

May be useful to have a "traffic light" style warning in pydarshan summary report for Lustre OST usage % / over-subscription. Finding the correct/desired stripe size/ratio might be tricky though?

cc @carns for HPC/Lustre specific clarifications

carns commented 2 years ago

For reference, the Lustre module data (if available) is in the format shown in table 7 at https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-util.html#_guide_to_darshan_parser_output.

This heuristic would only be shown for logs that have Lustre module data present, otherwise omit it.

But I think roughly you would get the total number of available OSTs (file servers) from the LUSTRE_OSTS field on any given file record. Then iterate through all file records and construct a hash/map of which OST numbers appeared in the LUSTRE_OST_ID_* list. The number of elements in the resulting hash then tells you how many OSTs were actually used by the application. The reason for iterating through all files is to account for jobs that open many files that individually are not widely striped but in aggregate hit all servers.

You could then say "This job used 10 of 40 available OSTs available on the file system." The color coding would be green for 50% or more, yellow for 10% to 50%, and red for < 10%. Maybe color it gray if the number of ranks in the job (according to the top level Darshan metadata) is less than say 4*LUSTRE_OSTS, as a rough heuristic that the job wasn't big enough for it to saturate that many servers so the heuristic doesn't really matter, and append a sentence to the description that says something like "The job has 16 MPI ranks and thus would not be likely to saturate available OSTs."

That's all kind of guessing at some thresholds on my part. We'll want a disclaimer on the heuristics section (when added) that warns people that these are just statistical heuristics, and users should consult with system support before making significant changes based on them.

This is a simple starting point, and IO could still be very imbalanced despite putting at least 1 byte on each server, but it would still be helpful for at a glance checks.