darshan-hpc / darshan

Darshan I/O characterization tool
Other
55 stars 27 forks source link

BUG: PyDarshan can't parse some logs (report from Slack) #941

Closed tylerjereddy closed 11 months ago

tylerjereddy commented 1 year ago

Reported by Nafise Moti on Slack, though just a few of us have the reproducing log file for now.

I was able to reproduce the traceback below the fold for a number of versions of darshan, rebuilding from scratch (including C libs) each time.

``` Traceback (most recent call last): File "", line 198, in _run_module_as_main File "", line 88, in _run_code File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/__main__.py", line 3, in main() File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/cli/__init__.py", line 164, in main mod.main(args) File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/cli/summary.py", line 569, in main report_data = ReportData(log_path=log_path) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/cli/summary.py", line 127, in __init__ self.report = darshan.DarshanReport(log_path, read_all=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/report.py", line 374, in __init__ self.open(filename, read_all=read_all) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/report.py", line 431, in open self.read_all() File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/report.py", line 547, in read_all self.read_all_heatmap_records() File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/report.py", line 627, in read_all_heatmap_records heatmaps[mod].add_record(rec) File "/Users/treddy/github_projects/darshan/darshan-util/pydarshan/darshan/datatypes/heatmap.py", line 78, in add_record raise ValueError("Record nbins is not consistent with current heatmap.") ValueError: Record nbins is not consistent with current heatmap. ```

Observed at these branches/tags/hashes:

If I go back to darshan-3.3.1, I end up with LD_LIBRARY_PATH issues, probably the code base has diverged too far from PyDarshan modernization at that point anyway.

Observation, in case it helps the C-level folks: it seems that the number of (STDIO HEATMAP) bins is correct for a large number of HEATMAP records before the mismatch happens during retrieval from the binary log.

Using darshan-parser with i.e., grep -E HEATMAP_READ_BIN | wc -l type analysis for the problem log:

read bins: 22460 / 104 bins per block (based on inspection/debug prints) -> 215.96 (not divisible) write bins: 22460 (same situation obviously)

If I try to similarly dissect the "working" e3sm_io_heatmap_only.darshan log:

175104 write bins / 114 bins per block = 1536 "blocks" == 512 ranks * 3 types of heatmaps 175104 read bins (same thing)

Ok, so then I got really curious because manual inspection of the darshan-parser output for the bad log seemed reasonable enough when spot checking. Need to programmatically analyze the bin number increments.

./darshan-parser /Users/treddy/rough_work/darshan/bad_log/darshan.log_202212161428 | grep -E HEATMAP_WRITE_BIN > log.txt

Use the Python code below the fold to parse the bin digits out and plot them for bad and good logs, respectively:

```python import re import numpy as np import matplotlib matplotlib.use("Agg") import matplotlib.pyplot as plt prog = re.compile(r".*HEATMAP_WRITE_BIN_(\d+).*") bin_numbers = [] with open("log.txt") as infile: for line in infile: m = prog.match(line).group(1) bin_numbers.append(int(m)) arr = np.array(bin_numbers) bincounts = np.bincount(arr) fig, ax = plt.subplots() ax.plot(np.arange(arr.max() + 1), bincounts, marker="o") ax.set_xlabel("HEATMAP BIN Number") ax.set_ylabel("frequency") fig.savefig("fig.png", dpi=300) ```

image

./darshan-parser /Users/treddy/github_projects/darshan-logs/darshan_logs/e3sm_io_heatmaps_and_dxt/e3sm_io_heatmap_only.darshan | grep -E HEATMAP_WRITE_BIN > log.txt

image

So, indeed, it seems that PyDarshan and darshan-parser agree that the bad log is missing a few of the max bin number entries for a few ranks. At this point, I suspect we're looking at either a bindings or log-specific issue I should hand over to the Argonne folks?

tylerjereddy commented 1 year ago

So perhaps this means that for a small number of ranks, there is a non-zero probability that the final heatmap bin doesn't get produced/counted?

carns commented 1 year ago

To fill in a little more info, ranks 72 through 75 are the exact 4 ranks that are one bin short. They stop at bin 102 while the rest go to 103.

This is weird because the ranks should all do an identical, histogram-content-agnostic calculation to collapse and prune the heatmap to an identical size.

The only thing I can think of is that maybe the collective to reach a consensus on the global end timetamp either didn't run or happened in the wrong order or something? https://github.com/darshan-hpc/darshan/blob/main/darshan-runtime/lib/darshan-heatmap.c#L518

If some ranks were basing their heatmap normalization on slightly different end times somehow then it might manifest like this.

That's just a wild guess; I'm not sure what's going on here.

carns commented 1 year ago

The job was executed with the Darshan environment variable set that disables shared file reduction.

Theory: this prevented the _redux() function from being executed in the heatmap module, which in turn caused some ranks to work from different (local) end timestamp values for normalizing the bin count.

If so, this is a runtime bug. For most modules reduction is optional; but for the heatmap it actually needs to be mandatory to make sure that the histogram dimensions are consistent.

carns commented 1 year ago

We should be able to validate by running a job across N nodes and printing the timestamp the ranks are using for the heatmap normalization, then repeating it with the disable reduction option set, and confirm they are using different values.

The runtime fix would be to exempt this module from honoring the disable reduction flag, if that's not too messy.

We might be able to "fix" existing logs on the util side too. When the parser starts iterating through heatmap records, it can hold the nbins value from the first record. If it sees subsequent records with different nbins values then it can either skip bonus bins or add zero'd out bins to match?