Refactor I/O to rely on DataFrames & provide storage options

aufdenkampe commented 3 years ago

Implement the plan we've discussed to abstract out model I/O, so that:

model runs by interacting with a dictionary of data frames in memory
- presently the model reads/writes to HDF5 during the model execution
reading/writing to storage from the dictionary of data frames is done with a separate set of functions.

This will enable us to read/write to Parquet (#23), Zarr, and other high performance file formats, in addition to HDF5.

For performance, consider using Dask DataFrame, which is very similar to and integrated with Pandas Dataframes but has built-in parallelization and chunking that allows for performant manipulation of very large datasets that are bigger than RAM. We might want to use these Dask dataframes instead of Pandas dataframes throughout the repo (or at least easily switch between the two).

Dask Tutorial: http://gallery.pangeo.io/repos/pangeo-data/pangeo-tutorial-gallery/dask.html

htaolimno commented 3 years ago

forked a branch develop_WDMclass for development of a WDM class and separate HDF5 from the reader.

htaolimno commented 3 years ago

@ptomasula I encountered some issue in the code from the branch - develop_readWDM that I forked from. When I commented out the numba decorators for debug purpose, the code runs into infinite loop. It took me a while to figure out the default integer Numpy is only 4bytes. This short length caused losing bits in your datetime encoding function. The code seems to run fine when the numba decorators were restored. I guess the compiled code may be allocated longer integer size and therefore no such issue. Any comments ?

ptomasula commented 3 years ago

@htaolimno, I also encountered that behavior when I was developing the datetime functions. The datetime integer uses 26bits for the month through second components and then any remaining bits store year information. As you noted any 4byte integer will cause problems because the we will loose bits related to the year component.

I thought had solved that with the conversion to a python Int object on these lines. Without numba, I would expect whatever integer was input to be converted to the python Int object, which is as a long of arbitrary length. It should expand the number of bit used to accommodate even in the initial type of the integer was only 32bit. so I'm surprised you encountered this issue when you commented out the numba decorators.

However I did some testing and there is also some problematic behavior with how the Int conversion works when numba is applied. When using numba, if I explicitly define a numpy.int32 object and try to convert it to a python Int I can still get a numpy.int32 object back (i.e. type(int(numpy.int32)) = numpy.int32), so numba is taking some liberties in interpreting the type for that conversion. Anyway, I think the solution to both of these problems is the same. Lets explicitly define the integers that are used in these datetime functions as 64bit using the numpy.int64 object. You can pull in commit 1a62f38 which should solve the issue you encountered.

htaolimno commented 3 years ago

@ptomasula I hesitate to pull in your most recent changes, since I am further along my development. Instead, I've developed a simple solution for the time being. I kept the numpy.int32, but used .item() to pass the underlying python integer to the datetime encoding subroutine.

aufdenkampe commented 3 years ago

We tried to merge in Paul's fixes with 22b457abef2f9fd9c2c213af078a021c579894fd, but stripped most of the fix by resolving the merge conflict.

@ptomasula, can you fix?

ptomasula commented 3 years ago

@aufdenkampe @htaolimno I reset the branch prior to the merge, and then performed a new merge where I manually resolved the merge conflicts. See commit d64c25a.

This also gave me a chance review the code so I can provide some feedback on what I think next steps are. I'll open up a pull request shortly so I can type up the notes in the code review format.

aufdenkampe commented 3 years ago

@ptomasula, awesome! That's a lot cleaner and much easier to understand. Thanks.

ptomasula commented 3 years ago

@htaolimno I did a code review of the WDMReader class. Thank you so much for the solid work on the first phase of this. I think the class is pretty close and with some restructuring should be good to merge in. Hua when you have a chance please look over the code review and see if you can start addressing some of the comments.

Meanwhile this week, I'll start putting together my thoughts on the IO abstraction class and maybe even write an outline of a class that we can build off of.

aufdenkampe commented 3 years ago

@ptomasula, thanks for setting up PR #34 and doing an initial code review with suggestions.

As you ponder a new IO abstraction class, it might be useful to know that I did some further exploration this weekend of HDF5, and I've changed my tune regarding using PyTables. The short story is that we will continue to rely on PyTables for our HDF5 IO, which gives us that nice integration with Pandas and Dask data frames.

So let's get the abstraction to a level where we're managing the tables in Pandas (and interchangeably with Dask).

LimnoTech / HSPsquared

Refactor I/O to rely on DataFrames & provide storage options #27