Try HDF5 / PyTables / Pandas integration to speed time series I/O

aufdenkampe commented 7 years ago

One of the ideas I had back in 2012-2013 when we were developing ODM2 was to use the HDF5 file format in certain cases to improve performance, because of the benefits of HDF5:

High performance read/write of files, especially for very large files (much faster than text formats, such as CSV, JSON or XML).
Compressed binary format for portable files that is very space efficient, on disk and for exchange
Supports data slicing of files that are bigger than memory.
Hierarchical Data Format (HDF) can contain simple dataset structures, has self-describing metadata, and can support heterogeneous data types
NOTE that NetCDF uses an HDF5 container.

The two use cases I had in mind were:

High performance web services, to exchange or deliver ODM2 "Datasets" via specialized web services (because read/write is fast, because it is compact and does't take as much I/O bandwidth). I think a YODA file, in a tabular array format, could be put in an HDF5 container.
High performance database functionality, via a hybrid of a standard RDBMS ODM2 instance that stores TimeSeriesResults in HDF5 files. At the time, it seemed that a lot of people doing nuclear physics were using similar approaches. Also, Aquatic Informatics does something like this, storing all their Time Series data in a "proprietary" binary file that their MS SQLserver points to. Roelof Versteeg also has done this for very large datasets.

Given that the ODM2PythonAPI uses the Pandas library, I think we could tap into HDF5 very easily for one or more of these uses. Here are a few links to information:

In writing this, I have come across some recent posts about people who are not happy with HDF5 (such as this: http://cyrille.rossant.net/moving-away-hdf5/ or https://www.rustprooflabs.com/2014/11/data-processing-in-python). I read the comments on the first of these two articles, and it sounds like most of the issues have been with improper direct use of the C library (and the lack of a Javascript library.

People who use the Python libraries (h5py and PyTables) seem to have very positive experiences. The fact that PyTables is actively supported by ContinuumIO and NumFocus, and is an important package in SciPy and the binary format of Matlab, all suggests that HDF5 is still well loved and useful to many.

If we approached the use of HDF5, we would want to use of refined libraries (such as the Pandas/PyTables integration), and in small steps, such as in the time-series data caching work that Jeff was just describing to me related to improving the CSV delivery of EnviroDIY datasets.

I'm interested in your thoughts!

cc: @horsburgh, @sreeder, @emiliom, @lsetiawan, @miguelcleon

aufdenkampe commented 7 years ago

I recently came across the TsTables library, which is a "A Python package to store time series data in HDF5 files using PyTables". It acts as a simple extension to PyTables and is applied within Pandas.

For Documentation, see http://andyfiedler.com/projects/tstables/

The use case is minutely data, and it boasts sub-second benchmarks to append or fetch a random month of data out of a 32 million row file.

I'm imagining that all the results from a given site could be mirrored from ODM2 into in a single HDF file, and that requests to fetch and subset the data would go to that file, rather than to the ODM2 database. Clients, however, would only write to the database.

Thoughts? This sounds less clunky than mirroring to an InfluxDB instance and fetching from there.

aufdenkampe commented 7 years ago

Here are the notes on how @roelofversteeg uses HDF5, from on our Aug. 29 ODM2 team call.

Integrate with HDF5 for DTS (Distributed temperature sensing)

20k data points per hour
Integrated with MySQL
- Query in MySQL to get pointer to HDF5 file
- Slice HDF5 file using Python libraries (PyTables?)
Works well with very structured data

horsburgh commented 7 years ago

@aufdenkampe - It isn't just speed of file access. We also have to have integrity of transactions and data across multiple simultaneous users with reads and writes in a web environment. Individual files are not well suited for this.

aufdenkampe commented 7 years ago

@horsburgh, the approach that I've read about uses a hybrid of a RDBMS (mostly PostgreSQL) with HDF5 files. The idea is that PostgreSQL would handle the all the writes from multiple users, and also most (if not all) of the reads. Therefore transactional integrity is maintained. I think there would be two approaches to integrate with HDF5 files. For both of these approaches, there would be a 1:1 map between a Result (or a Dataset) and its specific HDF5 file.

A Result value table is automatically mirrored in a corresponding binary HDF5 file, which would keep the primary copy in the RDBMS. The HDF5 file would only be read by a user action.
A Result points to (or actually stores) a corresponding binary HDF5, but the translational integrity of modifying that file is managed by PostgreSQL (I think). See https://www.postgresql.org/docs/9.6/static/datatype-binary.html

For the EnviroDIY (and perhaps most other) use cases. Only one user has the privileges to write data to a results values table, and the writing is happening via our applications (post requests), so the likelihood of simultaneous multi writes from different users is not possible for those systems.

aufdenkampe commented 7 years ago

@horsburgh, for the record, and based on Roelof's experience, it appears that there are other approaches -- such as Materialized views, storing the data as array objects, indexing, etc. (as described in issue #7) -- that would be worth trying before HDF5 integration.

horsburgh commented 7 years ago

@aufdenkampe - agreed, these may be initially more promising just because they use the existing machinery of PostgreSQL.

ODM2 / ODM2-performance-optimization

Try HDF5 / PyTables / Pandas integration to speed time series I/O #6