darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 27 forks source link

BUG: PNETCDF can have negative opcount frequencies #904

Open tylerjereddy opened 1 year ago

tylerjereddy commented 1 year ago

On latest main, you can get negative operation counts (see below) with this log repo file summay report: python -m darshan summary e3sm_io_heatmap_only.darshan. This probably needs investigation/regression test + fix.

image

shanedsnyder commented 1 year ago

Good catch. In cases where we have to up-convert old module records, we use -1 to denote that a particular counter wasn't instrumented properly (in this case, this log was generated with Darshan's old PnetCDF module before we adopted much more detailed instrumentation, so most of the counters in the new PnetCDF module aren't supported). We need to make sure we zero those back out before plotting.

I'll try to remind myself how we fix this for other modules and apply a similar fix. Probably the most useful regression test to add would be to ensure that no op count plots have negative values, that's true across modules universally.

shanedsnyder commented 1 year ago

In #907, I've updated op count and access histogram plots to always have a ylim of 0, so we won't have to worry about the problem, at least visually, going forward.

As a final refinement, I think we could try to apply the following strategy (pulled from old issue #569):

Add checks such that counters are only included in figure if they were enabled at runtime. It looks like MMap is the only one that is disabled by default, but I'll have to double-check

That approach is nice in that it allows us to distinguish between operations that truly were never called (op count of 0) vs those that Darshan could not instrument at runtime (e.g., POSIX_MMAPS if mmap instrumentation is disabled).

tylerjereddy commented 1 year ago

In https://github.com/darshan-hpc/darshan/pull/907, I've updated op count and access histogram plots to always have a ylim of 0, so we won't have to worry about the problem, at least visually, going forward

I don't see any such change in gh-907.

For the rest of it, I think I like the idea of enforcing a bottom-out of 0 in the data structure, or using a genuine missing value sentinel instead of some kind of reduction on -1s or whatever is going on there. I'd need to think about how much the data structure in question gets used though.

shanedsnyder commented 1 year ago

Oh sorry, I accidentally linked the wrong issue. I meant #910.

tylerjereddy commented 1 year ago

The sentinel should probably be pd.NA eventually, or whatever they are using these days, assuming that nan is not being used because this is an integer type.