Investigate building a data pre-processor

samlamont commented 1 year ago

Summary Brainstorming notes on building a pre-processor to convert raw data to TEEHR data model (de-coupling the processes of downloading raw data and importing to TEEHR). This would allow us to optimize each process independently

Benefits • By ignoring the steps to format to TEEHR data model NWM point data can be downloaded much more efficiently (ex. 2 days of medium range forecasts for all reaches < 10 mins on large 2i2c instance; single jsons --> multizarrtozarr --> dask dataframe --> partitioned parquet) • The pre-processor can import raw data directly into a duckdb database, potentially improving query performance (as opposed to querying parquet files) • The pre-processor can include methods to calculate additional fields (eg. binary exceedence flags) to further optimize query performance • A standalone tool will provide more flexibility for users to import raw data to the TEEHR data model (likely necessary for testbeds) • Components: -- Field mapping to connect raw data attributes to TEEHR data model -- Methods to convert local files (parquet, csv, etc) to TEEHR data model -- Pydantic models of TEEHR data model tables

cc. @mgdenno

samlamont commented 1 year ago

General notes

Overall goal: Provide ability to calculate additional fields to increase query efficiency by enabling threshold or binary-based filtering
These user-added fields/stats need to be timeseries-based, otherwise there's no real value added beyond current query capabilities
We'll assume the obs and sim data conform to TEEHR data model (ie, by using the current loading functions or assuming the user has performed the formatting on their own)
Steps
Get paths to data (initialize class)
Read data into memory
Perform the join by location_id and value_time
Allow the user to pass in a function to calculate new, timeseries-based field(s), that can later be used for filtering and/or grouping in queries
- The function would have access to the joined table fields
- The function would also have access to fields in other tables (ex. spatial attributes)
- Will need to work around hardcoded pydantic field validation here, when introducing dynamic in-memory field(s)
- Random thought: Should we provide some default functions users could take advantage of? For example: finding peak flows in a timeseries to look at peak flow and timing errors, or performing hydrograph separation to pick out base flow, or calculating 2-year flow, etc
Save the joined data and additional fields to a new table, either in parquet or direct to duckdb database, depending on the impact to querying efficiency

Other thought

In the last CIROH call, Andy had a question about whether the user could specify their own metrics to calculate in the query side -- could we build functionality to pass a function in the query to calculate some custom metric? (or does this already exist?).

cc. @mgdenno

samlamont commented 1 year ago

Potential approach:

dataset = TeehrDataset(
  primary_filepath = "",
  secondary_filepath = "",
  locations = "",
  etc.
)

dataset.join()
dataset.validate()
dataset.add_field(
  name="",
  some_function()
)
dataset.get_metrics(filters, etc.)

samlamont commented 1 year ago

Initial comparisons of DB-based query vs. parquet-based query, ex.

order_by = ["lead_time", "primary_location_id"]
group_by = ["lead_time", "primary_location_id"]

df1 = tqu.get_metrics(DATABASE_FILEPATH,
                      group_by=group_by,
                      order_by=order_by,
                      include_metrics="all")

show ~2x speedup with database approach at the expense of file size (DB ~10x the size of raw parquet), more testing is needed.

This branch also addresses issues #69 and #42

samlamont commented 1 year ago

Next steps for TeehrDataset class:

Join geometry by default
Adding multiple configurations/runs in one database?
- Would require filter by config
Try dumping database table to parquet file to test size and query speed on that
Pydantic field validation
- Dynamically create enum?
- Write fields to file, the check against that file (hack)
- Let it pass, then find another way to protect against sql injection
Use cases
- Starting from scratch
- Starting with an existing database
- Set parquet paths to None
- Change initialization
- Move create table outside of init()

mgdenno commented 1 year ago

Some notes while I am looking at this branch (before it is complete):

I think that the create_joined_timeseries_table() should rather be an insert_joined_timeseries() function and take the paths to the relevant parquet files. This would allow it to be run multiple time for different configurations to the same database. This would also make initializing with an existing database easier - basically the initialization would be the same - it would only take path to database file.
It is mentioned elsewhere in this issue, but we definitely need robust user input validation
Field type enum for field_dtype would be good in _add_field_name_to_joined_timeseries(()
Should add an argument to allow drop_first a calculated field before adding it
Need a get_unique_field_values(field_name) method that would return all the distinct values for a field
Need get_timeseries() methods. Maybe separate methods for primary and secondary, or an argument to indicate which.
There is a lot of inconsistency around the way queries are constructed and executed, some are in the class and some are in a queries/duckdb_database.py file, some take the database_filepath and some take a duckdb.connection(), some return a dict and some return a pd.Dataframe. The less consistent it is the harder it is to reason about what it the code is doing. I think we need to make the way we connect to the database, and construct and execute queries consistent across all methods.
It is probably a stretch at this point, but possibly think about doing so in a away that would allow us to run the queries against another type of database at some point (understanding that we currently use some very python/duckdb specific features that don't exist elsewhere). We might get some inspiration from other packages that make use of multiple databases such as Django, SQAlchemy, etc.
Remove method/function arguments that are not used or are not needed. For example, join_and_save_timeseries(include_geometry, geomerty_filepath) are not in use. The related parts of the query can also be removed.
Tests. Putting test code in __main__ is OK during initial testing but tests should be in tests.

samlamont commented 8 months ago

@mgdenno I think this issue about the TEEHRDataset class can be closed, let me know if you'd like keep it open

mgdenno commented 2 months ago

Closing as superseded by v0.4 plans.

RTIInternational / teehr

Investigate building a data pre-processor #67

General notes

Steps

Other thought