darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 27 forks source link

WIP, ENH: parquet demo #929

Open tylerjereddy opened 1 year ago

tylerjereddy commented 1 year ago

We've had some discussions about supporting a parquet or arrow-like format to avoid the various idiosyncrancies and performance issues related to the in-house binary format, possibly through a converter of the binary format to parquet format. This is more of a demo than something that is meant for serious code review, for now...

1) It may help spark some discussion about how this should work because I already made some potentially-controversial decisions like concatenating along the columns to fuse counter and fcounters 2) The various TODO comments I added around try/except blocks should give a good indicator of the number of places in the code where changes would be needed to produce a more complete summary report from parquet input 3) Sometimes it is easier to develop from a (crude) prototype if a summer student picks this up (vs. from scratch)

The example below shows what happens when producing the summary report with the 1 parquet file I tested with. It correctly reproduces a single table in the report since that is all I added support for, for now. Perhaps the other notable observation is that the gzipped parquet file is about 7X larger than the native binary file, and the native binary also contains more raw data because we're currently excluding DXT_POSIX for the parquet format, for now. I don't consider file size/compression a priority at this stage of development/consideration though.

python -m darshan summary /Users/treddy/rough_work/darshan/test_parquet/runtime_and_dxt_heatmaps_diagonal_write_only.parquet.gzip

image

carns commented 1 year ago

Neat! Can you explain how to run the converter and/or share the example parquet file?

tylerjereddy commented 1 year ago

Neat! Can you explain how to run the converter and/or share the example parquet file?

A Python script like this should do the trick locally, if you're all setup on this feature branch with the logs repo installed as well, etc.

from darshan.log_utils import get_log_path, convert_to_parquet

log_path = get_log_path("runtime_and_dxt_heatmaps_diagonal_write_only.darshan")
convert_to_parquet(log_path, "output.parquet.gzip")

Then produce the HTML report as usual with the parquet file. Obviously only POSIX is handled at the moment.