Closed tylerjereddy closed 1 year ago
So perhaps this means that for a small number of ranks, there is a non-zero probability that the final heatmap bin doesn't get produced/counted?
To fill in a little more info, ranks 72 through 75 are the exact 4 ranks that are one bin short. They stop at bin 102 while the rest go to 103.
This is weird because the ranks should all do an identical, histogram-content-agnostic calculation to collapse and prune the heatmap to an identical size.
The only thing I can think of is that maybe the collective to reach a consensus on the global end timetamp either didn't run or happened in the wrong order or something? https://github.com/darshan-hpc/darshan/blob/main/darshan-runtime/lib/darshan-heatmap.c#L518
If some ranks were basing their heatmap normalization on slightly different end times somehow then it might manifest like this.
That's just a wild guess; I'm not sure what's going on here.
The job was executed with the Darshan environment variable set that disables shared file reduction.
Theory: this prevented the _redux() function from being executed in the heatmap module, which in turn caused some ranks to work from different (local) end timestamp values for normalizing the bin count.
If so, this is a runtime bug. For most modules reduction is optional; but for the heatmap it actually needs to be mandatory to make sure that the histogram dimensions are consistent.
We should be able to validate by running a job across N nodes and printing the timestamp the ranks are using for the heatmap normalization, then repeating it with the disable reduction option set, and confirm they are using different values.
The runtime fix would be to exempt this module from honoring the disable reduction flag, if that's not too messy.
We might be able to "fix" existing logs on the util side too. When the parser starts iterating through heatmap records, it can hold the nbins value from the first record. If it sees subsequent records with different nbins values then it can either skip bonus bins or add zero'd out bins to match?
Reported by Nafise Moti on Slack, though just a few of us have the reproducing log file for now.
I was able to reproduce the traceback below the fold for a number of versions of darshan, rebuilding from scratch (including C libs) each time.
Observed at these branches/tags/hashes:
main
darshan-3.4.2
darshan-3.4.1
darshan-3.4.0
main
I think)If I go back to
darshan-3.3.1
, I end up withLD_LIBRARY_PATH
issues, probably the code base has diverged too far from PyDarshan modernization at that point anyway.Observation, in case it helps the C-level folks: it seems that the number of (
STDIO
HEATMAP) bins is correct for a large number of HEATMAP records before the mismatch happens during retrieval from the binary log.Using
darshan-parser
with i.e.,grep -E HEATMAP_READ_BIN | wc -l
type analysis for the problem log:read bins:
22460
/104
bins per block (based on inspection/debug prints) ->215.96
(not divisible) write bins:22460
(same situation obviously)If I try to similarly dissect the "working"
e3sm_io_heatmap_only.darshan
log:175104
write bins /114
bins per block = 1536 "blocks" ==512
ranks * 3 types of heatmaps175104
read bins (same thing)Ok, so then I got really curious because manual inspection of the
darshan-parser
output for the bad log seemed reasonable enough when spot checking. Need to programmatically analyze the bin number increments../darshan-parser /Users/treddy/rough_work/darshan/bad_log/darshan.log_202212161428 | grep -E HEATMAP_WRITE_BIN > log.txt
Use the Python code below the fold to parse the bin digits out and plot them for bad and good logs, respectively:
./darshan-parser /Users/treddy/github_projects/darshan-logs/darshan_logs/e3sm_io_heatmaps_and_dxt/e3sm_io_heatmap_only.darshan | grep -E HEATMAP_WRITE_BIN > log.txt
So, indeed, it seems that PyDarshan and darshan-parser agree that the bad log is missing a few of the max bin number entries for a few ranks. At this point, I suspect we're looking at either a bindings or log-specific issue I should hand over to the Argonne folks?