darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 27 forks source link

BUG: DXT heatmap bar graph inconsistencies #630

Open nawtrey opened 2 years ago

nawtrey commented 2 years ago

Background

Noted in gh-622, the DXT heatmap figures generated by plot_heatmap() still do not reflect the input dataframe. Specifically, if we plot snyder_acme.exe_id1253318_9-27-24239-1515303144625770178_2.darshan, we see some instances where the vertical bar graph is non-zero but the heatmap shows no data. I've boxed in the section with the inconsistency:

**Note: the following graphs are all generated using branch nawtrey_issue_575_update_ymax

DXT_POSIX:

snyder_dxt_posix

DXT_MPIIO:

snyder_dxt_mpiio

I checked and the values are certainly in the dataframe, and for 7 ranks there are values of ~5e6, so these should show up orange-red like the values near them. This was checked by saving the dataframe as html and browsing through the values for the DXT_MPIIO figure. Here's an archive with the html file: hmap_data.tar.gz

If the input data is fine (and it seems to be), then it appears the heatmap is having issues representing the input data. I think this is because we are trying to shove 8000+ y-axis bins into a 4.5" tall figure. Even if the entire height was taken by the heatmap, at 300 DPI we are only going to see 1350 distinct y-axis bins. So ultimately I think this is really a resolution problem.

Solutions

I think there are 3 ways we can handle this:

  1. Set a maximum number of y-axis bins, and if nprocs is greater than it group the ranks together.
  2. Set the resolution and/or figure size to scale with nprocs.
  3. Set the resolution and/or figure size to scale with nprocs, and add the shading="gouraud" argument to the sns.heatmap(). This gets passed to the matplotlib.pyplot.pcolormesh() function (which is what makes the heatmap), and it will sort of "blend" the bins:
      'gouraud': Each quad will be Gouraud shaded: The color of the corners (i', j') are given 
      by C[i', j']. The color values of the area in between is interpolated from the corner values.

I think the team decided against 1. early on, so I haven't spent any time working on that solution.

For 2., here is the DXT_MPIIO figure saved with dpi=2000: snyder_acme_DXT_MPIIO_high_res

We get our data to appear, although it is pretty difficult to see without zooming in when the bins are 5e-4" tall.

For 3., here is the DXT_MPIIO figure saved at dpi=600 with shading="gouraud" set: snyder_acme_DXT_MPIIO_gouraud

Here we can see a bit easier, but the horizontal bins are still pretty difficult to see.

carns commented 2 years ago

Wow, good find. Out of those options, IMHO the gouraud shading looks the most sensible. I would be worried about upping the DPI for general purpose use because it could presumably become a file size problem with enough ranks (and the points may be very tiny visually).

Binning ranks to some maximum y dimension doesn't seem like it would be too bad to me, but TBH I've forgotten the discussion that eliminated it previously. Regardless, gouraud still seems like a reasonable choice to me since it doesn't require any manual bin manipulation on our part.

At some point we are going to lose fidelity no matter what, and if someone wants the details they will need to look at the data hands on. We just need this to be a sensible first cut view in the summary report.

nawtrey commented 2 years ago

Here are the file sizes of the different figures:

1132275 --- 2000_dpi_numpy_hack.png
935723 --- snyder_acme_DXT_POSIX_2000_dpi_original.png
932812 --- snyder_acme_DXT_MPIIO_2000_dpi_original.png
331491 --- snyder_acme_DXT_POSIX_600_dpi_gouraud.png
304666 --- snyder_acme_DXT_MPIIO_600_dpi_gouraud.png

There is some variance in the file size based on how the bins are populated, so I added 2000_dpi_numpy_hack.png as a reference (has a value of 1 in every bin). But it is still 1.1 MB, which may not be too controversial.

As far as binning on the y-axis, I don't remember exactly what was said, I just remember not worrying about it after some discussion with the team.

Personally I think as long as we can resolve the bins reasonably we should avoid binning, but if we need to implement a solution for larger nprocs values we can. Regardless of the path we take, I think we will need to set a threshold for the max DPI we are willing to use (probably based on the resultant file size), because that would determine what our nprocs limit is before having to implement binning, gouraud, or something else.

tylerjereddy commented 2 years ago

Could scale figure size and leave DPI constant, I think html supports frames with scroll bars along both axes if you want to contain the heatmap footprint in a div element of some size. Changing the resolution dynamically seems a bit more confusing than having a constant DPI but rank/time-adjusted dimensions. Hard to say if worth the effort. If someone wants to run the biggest simulation ever and then complain about file size I'm not sure that's the driving design case we need to worry about.

Could also just add prominent/warning message at the threshold of the observable limit along a given axis and move on (i.e., "hey your simulation is huge, you might want to inspect the data manually if finer details are missing in the map due to size/resolution limits").

nawtrey commented 2 years ago

I like @tylerjereddy 's idea of adding an error message, I think it would be simple enough to write a function that checks for this and adds a flag to the report.

I tried this really quick:

diff --git a/darshan-util/pydarshan/darshan/experimental/plots/plot_dxt_heatmap.py b/darshan-util/pydarshan/darshan/experimental/plots/plot_dxt_heatmap.py
index 9407f8f..8eb3bec 100644
--- a/darshan-util/pydarshan/darshan/experimental/plots/plot_dxt_heatmap.py
+++ b/darshan-util/pydarshan/darshan/experimental/plots/plot_dxt_heatmap.py
@@ -289,6 +289,28 @@ def adjust_for_colorbar(jointgrid: Any, fig_right: float, cbar_x0: float):
     )

+def get_ax_canvas_height(fig, ax):
+    # get the height of the plot canvas
+    ax_canvas_height = ax.get_window_extent().transformed(fig.dpi_scale_trans.inverted()).height
+    return ax_canvas_height
+
+
+def check_fig_dpi(fig, ax, nprocs):
+    ax_canvas_height = get_ax_canvas_height(fig=fig, ax=ax)
+    # calculate maximum number of ybins that can be resolved
+    max_ybins = int(np.floor(ax_canvas_height * fig.dpi))
+    # calculate number of ybins required to resolve nprocs ybins
+    required_ybins = int(np.ceil(nprocs/ax_canvas_height))
+
+    if nprocs > max_ybins:
+        warn_msg = (
+            "Too many MPI processes to resolve in DXT heatmap figure. \n"
+            f"Figure DPI is {fig.dpi} which supports nprocs <= {max_ybins} \n"
+            f"With {nprocs} processes, this figure requires dpi >= {required_ybins}"
+        )
+        print(warn_msg)
+
+
 def plot_heatmap(
     report: darshan.DarshanReport,
     mod: str = "DXT_POSIX",
@@ -345,6 +367,7 @@ def plot_heatmap(

     # build the joint plot with marginal histograms
     jgrid = sns.jointplot(kind="hist", bins=[xbins, nprocs], space=0.05)
+    jgrid.fig.set_dpi(300)
     # clear the x and y axis marginal graphs
     jgrid.ax_marg_x.cla()
     jgrid.ax_marg_y.cla()
@@ -427,6 +450,8 @@ def plot_heatmap(
     jgrid.ax_joint.set_xlabel("Time (s)")
     jgrid.ax_joint.set_ylabel("Rank")

+    check_fig_dpi(fig=jgrid.fig, ax=jgrid.ax_joint, nprocs=nprocs)
+
     plt.close()

     return jgrid

Here is the output for snyder_acme.exe_id1253318_9-27-24239-1515303144625770178_2.darshan:

Too many MPI processes to resolve in DXT heatmap figure. 
Figure DPI is 300 which supports nprocs <= 842 
With 8192 processes, this figure requires dpi >= 2916

Of course this just prints a message at the moment, but it could be leveraged in a flag, raise a proper warning at run time, etc.. Also the plot canvas should be ~3" tall which means we should get a recommended dpi of 2731, so this isn't quite right, but it's a starting point..

carns commented 2 years ago

I like the warning idea too. I think you could make it simpler for the purposes of the summary report tool. Maybe just a footnote or something added to the caption that says "Warning: sparse I/O access from individual ranks in jobs with more than 512 processes may not be visible at this resolution."

Maybe a command line option could be provided at some point that lets people bump the resolution in whatever way seems to make sense. If so, then that option could be suggested after the warning.

I still don't like automatically bumping resolution unless it is asked for. The problem isn't an individual user running this tool, but chances are good that somehow somewhere it will get included in an automated pipeline and inadvertently produce more data than expected. We have had automated systems that produced legacy darshan-job-summary.pl reports before that hit this problem.

nawtrey commented 2 years ago

I still don't like automatically bumping resolution unless it is asked for. The problem isn't an individual user running this tool, but chances are good that somehow somewhere it will get included in an automated pipeline and inadvertently produce more data than expected. We have had automated systems that produced legacy darshan-job-summary.pl reports before that hit this problem.

That's a good point. Yeah that sounds like good enough reason to just add a message somewhere.

nawtrey commented 2 years ago

I think we can set a constant in the ReportData constructor to be used when the figures are registered, and later add a command line option that allows users to change it. It could be a scaling factor (default 1) to scale the figure dimensions (with some sort of scroll bar situation for larger figures), or store the DPI (default 300) to be used on select figures with this issue.

We could then add a function for checking the resolution using nprocs and the figure plotting area height (like in the diff above), check a single DXT heatmap, and if it's triggered a flag gets registered in the report. I think we want to follow the format used by the partial data flags with a simple warning sign and text.

As far as where to put the flag, I think we already have issues with redundant captions so we would want to avoid adding redundant flags too. If 1 DXT heatmap has an error, the others will too. Maybe that encourages the addition of a flag/warning box at the top of the report to place warnings like this, like has been discussed previously. I would probably need to see an example of what we want that to look like before diving in though. I'm guessing a constant sized box with a scrollbar would be good, with a default message like "No warnings/errors to report".

tylerjereddy commented 2 years ago

Since this issue focuses on the vertical bars, it is also worth nothing that the horizontal bar issues still persist on pydarshan-devel after merging gh-622 for i.e., snyder_acme.. file:

image

nawtrey commented 2 years ago

Thanks @tylerjereddy. I've updated the issue title to better reflect the more general nature of this issue.