LimnoTech / HSPsquared

Hydrologic Simulation Program Python (HSPsquared)
GNU Affero General Public License v3.0
2 stars 0 forks source link

Write input data to Parquet file #23

Open aufdenkampe opened 3 years ago

aufdenkampe commented 3 years ago

We've decided to write all input data to a Parquet file, which is a high-performance binary data storage format designed for big-data and cloud-computing.

Parquet is tightly integrated with Pandas, and is designed to manage complex hierarchical, nested data structures, similar to HDF5.

Our intent is to support both HDF5 and Parquet for storage of input and output data.

ptomasula commented 3 years ago

@aufdenkampe @steveskrip @htaolimno Switching the code to support Parquet files may prove to be a larger under taking than initially expected. It seems the main method takes HDF files as an argument then subsequently opens that HDF as an HDFStore which is passed from the main method to the various sub-processes. Making the switch will require that all of those various methods are updated to use a different file format as well.

I'm also not super keen on having a single file format be the only one supported by the business logic. I think there's a strong argument to be made for having a pandas DataFrame be level at which the main code interfaces with the input data and we write the appropriate utilities to read other files and store them as a uniformly formatted pandas DataFrame. I'd like to think that through some more. One immediate issue that comes to mind is holding all of the input data as a single DataFrame in memory could be a problem. Maybe the solution is to write a method that can pull out just the TS necessary for the specific operation (similar to this line) but not specific to a file format. Open to other suggestions on how to best handle this.