Open tylerjereddy opened 1 year ago
Two other things to check here:
1) separate out the interpolate reproducer and drill down in pandas
or possibly SciPy interp1d
under the hood to see if there's any room for perf bumping there
2) What if we didn't interpolate at all, and instead used a series of NumPy vectorized ops to find the locations to fill in with 1s? Some NumPy operations are SIMD-accelerated for example, so could be worth checking on that.
The example below the fold appears to be 100X faster than pandas interpolation for our use case with 80,000 rows x 200 columns, and it doesn't even use Cython, just NumPy and Python. Let's see if I can do two things now: (1) demonstrate a genuine speedup for the e3sm_io_heatmap_and_dxt.darshan
log, without slowing other stuff down; (2) perhaps touch base with pandas upstream to see if this "interpolation" use case merits a fast path, or if it is too specialized for that?
For upstream query see https://github.com/pandas-dev/pandas/issues/48236
I should also note that the faster interpolation algorithm is probably a route to avoiding early conversion to the float64
datatype--without the need to use the pandas interpolate
, we can just place 1
between the start/end in each row without needing a float-based NaN sentinel, which should mean that a smaller unsigned (integer) byte type (np.unit8
) could suffice to hold the "masks."
Not sure how much that helps memory/performance overall, since we'd likely eventually have to produce an array of doubles down the line anyway, but doing the manual interpolation op on a smaller data type/array could keep the calculation closer to the processor (i.e., could fit in a closer memory location).
As a quick experiment, if we use more than 1 processor in
get_heatmap_df()
we can save ~8 seconds in the generation of the HTML report fore3sm_io_heatmap_and_dxt.darshan
. This is in branchtreddy_dxt_html_speedup
, which itself branches off of gh-784. It has no effect on the processing time ofsnyder_acme.exe_id1253318_9-27-24239-1515303144625770178_2.darshan
though, and the drop from42
to34
seconds isn't enough to justify the complexity that would be required to handle the concurrency properly (heuristic for size at which to use > 1 core, how much work per process, testing the concurrent vs. serial code, having concurrency off by default/opt-in, etc.).Nonetheless, I'll note this here for now since the interpolation takes up 27 seconds of the 42 total for
e3sm..
.