MazinLab / MKIDPipeline

The MKID Data Pipeline
http://web.physics.ucsb.edu/~bmazin/
6 stars 3 forks source link

Deprecate `pytables` in favor of `h5py` or a new format #98

Open ld-cd opened 3 months ago

ld-cd commented 3 months ago

pytables is a pretty consistent packaging issue and does not package all its depends (namely the hdf5 library) it should likely be replaced with h5py for maintaining file compatibility and if in-kernel queries are really needed we should switch to a more modern data format going forward like parquet and use either pandas or polars for queries.

The other alternative is vendoring pytables and committing to maintaining python compatibility and functional packaging going forward but this is not something I have time to do

bmazin commented 3 months ago

I think getting rid of pytables would be a ton of work, this should be a low priority unless there is a real show stopper.

ld-cd commented 3 months ago

Options for OLAP databases:

DuckDB publishes a perf comparison (so keep the bias in mind): https://duckdblabs.github.io/db-benchmark/

50G group by is likely the most representative of the hot path in our workload currently.

ld-cd commented 2 months ago

For a new format binney will serialize into parqut files https://mazinlab.github.io/binney/binney.html#BinDirectory

and has optional polars support https://mazinlab.github.io/binney/binney.html#BinDirectoryDF