WIP, ENH: add POSIX access pattern plot to darshan job summary

shanedsnyder commented 1 year ago

See commits for more details.

If things look good, I can add some tests that can at least check some expected summary_record counter values returned by acummulate_records().

A few thoughts:

This is the first plotting routine that takes a dataframe as input (others take a report object). I thought about this a bit, and think it would be good to generalize the plotting routines we provide away from report objects, to allow us to plot data coming from multiple log files. I didn't modify any other plotting routines, but could take a pass at this in another PR if that seems reasonable?
Related to above, but this new plotting routine is hardcoded to get plot data from the first row of the input dataframe for now. That works fine for the accumulator use case obviously since it's just a single row, but I'm not sure how to best generalize this. Seems like it should be easy? Ultimately, I think it would be nice if you could give the plotting routine a single row (in which case it plots the row values directly) or an entire dataframe (in which case it will need to run the accumulator internally to condense down to a single record). Any thoughts on this?
Can we convert the derived metrics structure returned by accumulate_records() into a more Pythonic object? summary_record is a dictionary now (i.e., not a cdata object), so seems that it would be cleaner if both return values are typical Python things, rather than derived_metrics being a C-like structure.

shanedsnyder commented 1 year ago

Here's what a new example report looks like for now (e3sm_io_heatmap_only.darshan):

So, obvious things to cleanup are to make sure the legend isn't on top of plot data, and probably to add a page break between plots (or otherwise reorg things so the big table isn't jammed next to this plot).

shanedsnyder commented 1 year ago

I'll respond more to other comments later, but wanted to start here:

I thought about this a bit, and think it would be good to generalize the plotting routines we provide away from report objects, to allow us to plot data coming from multiple log files. I didn't modify any other plotting routines, but could take a pass at this in another PR if that seems reasonable?

I'm perhaps less convinced of the value/priority for now--I think an issue with a clear description of concrete use cases and how the API would work for the current usages is likely in order. I believe some of our plotting routines behave quite differently, so I'm also not entirely convinced this strategy is appropriate in a super broad sense, and I'd also like the plotting functions as simple as possible--avoid doing data analysis inside the plotting function proper where possible.

If it were clear that this reduced the amount of code substantially that might be convincing, though for a infrastructure that is initially "report"/single file focused, it seems likely this would initially grow more complexity to allow generalization, and may bloat plotting functions with shims/analyses that have little to do with plotting proper.

Sure, we could document what each existing plot is expecting in terms of data in an issue or something. At a high-level, all they do now is take a report as input and extract all the data they need from that. I think the change I'm suggesting would be to instead have the caller of the plotting routines responsible for actually extracting the data in the format the plots expect and passing in directly, instead of having the plot routines digging around in report objects. Seems like that would actually simplify plotting code at the expense of requiring callers to be more explicit about the data they want plotted.

Of course, the main driver of this is not really code cleanup, it's just to generalize our plotting routines so that they could conceivably plot data that comes from multiple log files. As it stands, that is impossible, which will be a definite problem going forward. We really need this capability to extend Darshan to be useful in context of analyzing workflows (that generate many log files) or for analyzing tons of logs generated at a facility. I think continuing to tie our plotting infrastructure to single logs would be a mistake in that regard.

For folks that need functionality this advanced, I think an argument might need to be made that they can't just write their own plotting code? Thinking about the heatmap plotting, my brain melts a bit re: fusing files and DataFrames from multiple sources, and supporting bugs from users related to this kind of thing. Probably other such examples, and cases that are a bit more on the fence. So, my initial thought is--case-by-case basis with strong real-world justification, and try to keep plotting code devoid of analysis complexity and instead accept the final condensed/consensus data structures?

The use cases I mention above are definitely of interest to our users. I'm not sure the changes are really that challenging (we'll see, I think the plot in this PR provides some supporting evidence that this could be relatively simple) that we have to punt on this for users. The heatmaps will almost certainly be most involved, but I don't think we have to figure that out all at once. The other plots seem like they would be incredibly straightforward to switch to a dataframe or record based input, which would make them trivial to apply for single- or multi-log use cases. I don't really see the downside other than that it requires a small refactor?

tylerjereddy commented 1 year ago

Maybe just shift that stuff to an issue yeah--for code that is basically "produce a bar chart" like here, I don't really even see much value in having it public, could basically just have the aggregator public and a small example in its docstring of how you might condense and plot, freeing us from any plotting code maintenance entirely.

tylerjereddy commented 1 year ago

And if DataFrames are embraced more globally in the code base, they already have plotting methods attached to them as well, so calling DataFrame.plot.bar() and so on may very well get users close enough in many cases, just need some small docs that are tested/maintained with examples.

shanedsnyder commented 1 year ago

why isn't the total category equal to the sum of consecutive and sequential (i.e., the read values here?)--if the categories aren't mutually exclusive I wonder if that is worth noting

I can see how this could be confusing, as I was convinced there was a problem too before thinking about it more. The issue is that the "sequential" count is inclusive of the "consecutive" count (see the definitions in the figure caption). I.e., you have to subtract the consecutive count from the sequential count to get the count of operations that were only sequential.

There is a 3rd implicit category not represented which covers what is essentially the "random" access case -- in that case, the read/write operation actually seeks backward in the file. Total operations is then a sum of the random, consecutive, and only sequential operations, if that makes sense.

I'll have to think a bit more and see if any tweaks to the plot make sense or if we should at least add some clarifying text about this.

shanedsnyder commented 1 year ago

I'll have to think a bit more and see if any tweaks to the plot make sense or if we should at least add some clarifying text about this.

I ultimately just added a new figure caption to clarify that sequential is inclusive of consecutive. I'm not sure other tweaks will be much more helpful so I think I'm content with that. We could present "sequential" as being strictly sequential (i.e., does not include consecutive operations), but that also will require clarify text so not sure it's any better.

shanedsnyder commented 1 year ago

I think I'm all done with edits here and ready for any final review comments before merging.

tylerjereddy commented 1 year ago

I'll try to take a final look on April 17th, got swamped with grants and stuff this weekend

tylerjereddy commented 1 year ago

Ok, I looked at the diff again and checked a sample log->HTML locally and it seems "ok" and still consistent with the old perl report for access pattern plotting.

Since we really do need to wrap up the feature-completeness of the new Python report I'll go ahead and merge.

I think some of the regression testing didn't really happen here re: the values we see in the new plot, but I'll leave that up to the team I guess. As far as I can tell, we've just incremented the number of expected plots, which is at least better than nothing.

darshan-hpc / darshan

WIP, ENH: add POSIX access pattern plot to darshan job summary #903