ENH: Lustre OST mismatch "traffic light" warning

For reference, the Lustre module data (if available) is in the format shown in table 7 at https://www.mcs.anl.gov/research/projects/darshan/docs/darshan-util.html#_guide_to_darshan_parser_output.

This heuristic would only be shown for logs that have Lustre module data present, otherwise omit it.

But I think roughly you would get the total number of available OSTs (file servers) from the LUSTRE_OSTS field on any given file record. Then iterate through all file records and construct a hash/map of which OST numbers appeared in the LUSTRE_OST_ID_* list. The number of elements in the resulting hash then tells you how many OSTs were actually used by the application. The reason for iterating through all files is to account for jobs that open many files that individually are not widely striped but in aggregate hit all servers.

You could then say "This job used 10 of 40 available OSTs available on the file system." The color coding would be green for 50% or more, yellow for 10% to 50%, and red for < 10%. Maybe color it gray if the number of ranks in the job (according to the top level Darshan metadata) is less than say 4*LUSTRE_OSTS, as a rough heuristic that the job wasn't big enough for it to saturate that many servers so the heuristic doesn't really matter, and append a sentence to the description that says something like "The job has 16 MPI ranks and thus would not be likely to saturate available OSTs."

That's all kind of guessing at some thresholds on my part. We'll want a disclaimer on the heuristics section (when added) that warns people that these are just statistical heuristics, and users should consult with system support before making significant changes based on them.

This is a simple starting point, and IO could still be very imbalanced despite putting at least 1 byte on each server, but it would still be helpful for at a glance checks.

darshan-hpc / darshan

ENH: Lustre OST mismatch "traffic light" warning #601