darshan-hpc / darshan

Darshan I/O characterization tool
Other
56 stars 28 forks source link

ENH: rm repeated calls to log_get_generic_record #451

Open tylerjereddy opened 3 years ago

tylerjereddy commented 3 years ago

The C API currently only exposes a way to read 1 row of a "data structure"/"record table" at a time via i.e., libdutil.darshan_log_get_record.

Although Shane mentioned there may be plans to add a way to return multiple rows/records at some point in the future, we probably also have the option of moving the Python loops/calls to "effective C" via CFFI, or if we're open to other deps, Cython, numba, and so on...

May want to justify the time investment pretty clearly though before doing that.

jakobluettgau commented 3 years ago

So I think before deciding this, it might make some sense to exchange and brainstorm about what we should aim for as an internal representation and how we would like to handle memory management. I think this will also resolve some of the issues you pointed out about needing deep nested loops for some analysis. You probably have a lot of input on this from the various other packages you are involved with.

A few things to have in mind maybe: A related issue is that currently it takes some workarounds to fetch from the same log in parallel. E.g., you cannot fetch individual records alternating between different modules (a side effect of compression). A naive workaround just opens the file twice, but more appropriate maybe to keep multiple buffers that are needed on the C side. At the same time, going through 100k records in C completes in under a second, but may take 1-2 minutes from Python as is. So, I think this maybe something to get out of the way before thinking about taking advantage of things like Numba.

Similarly, individual records or ranges of records can not easily be seeked at will preventing us from allowing to expose efficient iterators (as is, it will be necessary to complete an iteration through a module before being in a safe state on the C side again) , but for very large logs, or some sampling tasks this might become necessary. Again, there is a variety of strategies to deal with this. Again, while this seems to be possible to workaround also from Python, it feels cleaner to have cooperating changes C and Python. E.g., to get around the compression messing with record offsets, some index can be built that memorizes some compressor states for offset x, and on top of this, parallel fetching can be implemented. But these are not as incremental isolated changes anymore, unfortunately.

As for moving from CFFI to a CPython extension: Since we already take advantage of an empty extension to automatically include lib-darshan with the binary wheels, CPython could be a good path forward.

At the same time: There is a certain charm to only relying on a functioning libdarshan-util.so (which may be easily overlayed with a site or user/spack install). Currently, libdarshan-utils has almost no dependencies, but the detailed HDF5 profiling and potentially other upcoming APIs that have opaque structure that maybe logged away. I am not sure if we really want those to become hard dependencies for pydarshan.