WIP: Pydarshan log_based sorting

Yanlilyu commented 11 months ago

Description: This branch adds a log_based routine to help user find the most I/O intensive jobs. The file is job_stats.py in PyDarshan CLI tools. It combines the aggregated data from multiple log files to a DataFrame, sorts data by the column name the user inputs in a descending order, and then filters the data with the first n (number_of_rows from user input) records. It returns a DataFrame with the n most I/O intensive jobs User input includes log_path, module, order_by_colname, number_of_rows. log_path should be a path glob pattern. order_by_colname should be "agg_perf_by_slowest", "agg_time_by_slowest", or "total_bytes". Example usage: $ python -m darshan job_stats darshan_logs/nonmpi_workflow/"worker*.darshan" STDIO total_bytes 5

shanedsnyder commented 11 months ago

I agree with Tyler's suggestions above, so if we could incorporate those I think that'd be great.

also tried running on all of the logs in the logs repo, just to cause trouble:

python -m darshan job_stats ~/github_projects/darshan-logs/darshan_logs/**/*.darshan STDIO total_bytes 5

And ran into an error:

Same here. I think this is being caused by shell wildcard expansion -- your shell will automatically expand the pattern above into all of the possible Darshan log files. To this script then, there will be a ton of log files followed by the remainder of the command line arguments which causes the rest of the command line parsing to blow up, obviously. If you enclose your pattern above in quotes (and use your actual home directory name rather than ~), it works fine as no expansion occurs.

Thinking about it more, maybe we should just lean totally on the shell wildcard expansion and not worry about anything glob related in the python script. I think users will be tricked by this too, figuring out the right way to express a pattern to get through the shell all the way to python. This means the script would instead just need to accept a list of log files on the command line, with the user responsible for creating that list (using a shell glob or by manually specifying them). Any objections to that? (Sorry @Yanlilyu, I know this is what you originally started with, but this may have been a bad suggestion on my part...)

If we go that direction, I'd suggest switching the remainder of the command line arguments from positional arguments to named arguments (e.g., -n rows, -m module, etc.). Further, we should probably just give them default values so users don't have to always specify them. We could default to total_bytes, POSIX module, and all rows, for instance?

Also, if printing to the terminal is the "final medium" for output, we could look at libraries like https://github.com/Textualize/rich or pandas-related things built on top of that for some nice formatting I suppose.

I agree, this would be nice, but we can always explore this later. I think the ultimate output format for this tool remains just shell output (it's really just a quick way to scan a bunch of log files for interesting jobs, it doesn't really need a super formal output format), so maybe that would be something like Rich rather than Pandas styling?

shanedsnyder commented 11 months ago

Oh, and BTW, after figuring out how to properly format the glob to account for all logs in the Darshan logs repo, I do run into an error that should be properly guarded against:

Traceback (most recent call last):
  File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/tmp/darshan/darshan-util/pydarshan/darshan/__main__.py", line 3, in <module>
    main()
  File "/tmp/darshan/darshan-util/pydarshan/darshan/cli/__init__.py", line 164, in main
    mod.main(args)
  File "/tmp/darshan/darshan-util/pydarshan/darshan/cli/job_stats.py", line 162, in main
    df_i = df_IO_data(log_paths[i], mod)
  File "/tmp/darshan/darshan-util/pydarshan/darshan/cli/job_stats.py", line 24, in df_IO_data
    posix_recs = report.records[mod].to_df()
KeyError: 'STDIO'

Not every log is guaranteed to have a given module included in the log file. If a module isn't found in a log file, we should assign it "N/A" values or something like that and put it at the end of the list. Does that seem reasonable?

shanedsnyder commented 11 months ago

One final comment after looking this over: Can we drop all other columns from the derived statistics dataframe other than the ones we are concerned with (total_bytes, agg_perf_by_slowest, agg_time_by_slowest)? I don't think they are useful to include and will help make the output more digestible.

shanedsnyder commented 11 months ago

We moved this all into one PR: #954

darshan-hpc / darshan

WIP: Pydarshan log_based sorting #949