WIP, ENH: parquet demo - Githubissues

tylerjereddy commented 1 year ago

We've had some discussions about supporting a parquet or arrow-like format to avoid the various idiosyncrancies and performance issues related to the in-house binary format, possibly through a converter of the binary format to parquet format. This is more of a demo than something that is meant for serious code review, for now...

a quick demo with POSIX-only support for a single summary report table for parquet input, and a converter that only supports POSIX and was only tested on a single log file
this does appear to allow the full test suite to pass while adding incredibly-crude summary report support for working with a parquet file that has POSIX counter/fcounter data
there are a few reasons to demo this:

1) It may help spark some discussion about how this should work because I already made some potentially-controversial decisions like concatenating along the columns to fuse counter and fcounters 2) The various TODO comments I added around try/except blocks should give a good indicator of the number of places in the code where changes would be needed to produce a more complete summary report from parquet input 3) Sometimes it is easier to develop from a (crude) prototype if a summer student picks this up (vs. from scratch)

The example below shows what happens when producing the summary report with the 1 parquet file I tested with. It correctly reproduces a single table in the report since that is all I added support for, for now. Perhaps the other notable observation is that the gzipped parquet file is about 7X larger than the native binary file, and the native binary also contains more raw data because we're currently excluding DXT_POSIX for the parquet format, for now. I don't consider file size/compression a priority at this stage of development/consideration though.

python -m darshan summary /Users/treddy/rough_work/darshan/test_parquet/runtime_and_dxt_heatmaps_diagonal_write_only.parquet.gzip

carns commented 1 year ago

Neat! Can you explain how to run the converter and/or share the example parquet file?

tylerjereddy commented 1 year ago

Neat! Can you explain how to run the converter and/or share the example parquet file?

A Python script like this should do the trick locally, if you're all setup on this feature branch with the logs repo installed as well, etc.

from darshan.log_utils import get_log_path, convert_to_parquet

log_path = get_log_path("runtime_and_dxt_heatmaps_diagonal_write_only.darshan")
convert_to_parquet(log_path, "output.parquet.gzip")

Then produce the HTML report as usual with the parquet file. Obviously only POSIX is handled at the moment.

darshan-hpc / darshan

WIP, ENH: parquet demo #929