RTIInternational / teehr

Tools for Exploratory Evaluation in Hydrologic Research
https://rtiinternational.github.io/teehr/
MIT License
8 stars 8 forks source link

Move add loading functions to table classes #259

Open mgdenno opened 2 weeks ago

mgdenno commented 2 weeks ago

Perhaps, for example we should have:

ev.timeseries.load_csv() -> convert -> validate -> load
ev.timeseries.load_parquet() -> convert -> validate -> load
ev.timeseries.load_netcdf() -> convert -> validate -> load
ev.timeseries.validate() -> validates both datasets that are being loaded but can also be 
mgdenno commented 1 week ago

@samlamont what you say about this idea?

samlamont commented 1 week ago

@mgdenno Looks good to me. But it makes me want to add ev.timeseries.plot() :) (Maybe we could have plot methods on the timeseries and location table classes and also use the accessor on the metrics results?)

I wonder if we should also do ev.joined_timeseries.create() or .load()? We could also have a .describe() method on each table to summarize it's contents (like get_field_statistics())

So we would have ev.fetch and ev.metrics, and the rest would be driven though the table classes (?)

mgdenno commented 1 week ago

I like the ev.joined_timeseries.create() and ev.joined_timeseries.describe()

I like ev.timeseries.plot() too but I feel like we would need to solve the re-collecting issue to really do it. Although I like the clean syntax around it, I'm afraid the user experience would not be good. I wonder if there could be a cache/clear mechanism that would prevent re-collecting results from PySpark? It would have to clear everytime filter(), order() or query() was passed or chained.

As far as having different methods on the table vs. the metrics, maybe as long as they were similar enough it wouldn't be too confusing. In this case for a timeseries query it would re-query each time a plot was created? ON small datasets this would be fine, on larger ones, I am less sure how performance would be.

Do you envision there be one plot method that takes arguments to specify what to plot and how? Or multiple plot methods?

samlamont commented 1 week ago

On the plotting, I was initially thinking of one method that takes arguments, but maybe it could be similar to the pattern pandas where you can do both df.plot(type="scatter", ...) and df.plot.scatter(...). It looks like PySpark has a persist method which could potentially be used to cache the data. Would take some experimentation.

Maybe it's a moot point however because I found that the accessor does actually have access to the dataframe and it's contents, and the attrs property seems to allow us to pass in info along with it, so we could include the list of metric pydantic models, for example, which could contain plotting info, along with other metadata, which could then be read by the plotting methods in the accessor class. Maybe that's the way to go. Sorry to derail the original point of this issue!

samlamont commented 1 week ago

@mgdenno I can work on this issue unless you prefer to?