RTIInternational / teehr

Tools for Exploratory Evaluation in Hydrologic Research
https://rtiinternational.github.io/teehr/
MIT License
8 stars 8 forks source link

Investigate building a data pre-processor #67

Closed samlamont closed 2 months ago

samlamont commented 1 year ago

Summary Brainstorming notes on building a pre-processor to convert raw data to TEEHR data model (de-coupling the processes of downloading raw data and importing to TEEHR). This would allow us to optimize each process independently

Benefits • By ignoring the steps to format to TEEHR data model NWM point data can be downloaded much more efficiently (ex. 2 days of medium range forecasts for all reaches < 10 mins on large 2i2c instance; single jsons --> multizarrtozarr --> dask dataframe --> partitioned parquet) • The pre-processor can import raw data directly into a duckdb database, potentially improving query performance (as opposed to querying parquet files) • The pre-processor can include methods to calculate additional fields (eg. binary exceedence flags) to further optimize query performance • A standalone tool will provide more flexibility for users to import raw data to the TEEHR data model (likely necessary for testbeds) • Components: -- Field mapping to connect raw data attributes to TEEHR data model -- Methods to convert local files (parquet, csv, etc) to TEEHR data model -- Pydantic models of TEEHR data model tables

cc. @mgdenno

samlamont commented 1 year ago

General notes

Other thought

In the last CIROH call, Andy had a question about whether the user could specify their own metrics to calculate in the query side -- could we build functionality to pass a function in the query to calculate some custom metric? (or does this already exist?).

cc. @mgdenno

samlamont commented 1 year ago

Potential approach:

dataset = TeehrDataset(
  primary_filepath = "",
  secondary_filepath = "",
  locations = "",
  etc.
)

dataset.join()
dataset.validate()
dataset.add_field(
  name="",
  some_function()
)
dataset.get_metrics(filters, etc.)
samlamont commented 1 year ago

Initial comparisons of DB-based query vs. parquet-based query, ex.

order_by = ["lead_time", "primary_location_id"]
group_by = ["lead_time", "primary_location_id"]

df1 = tqu.get_metrics(DATABASE_FILEPATH,
                      group_by=group_by,
                      order_by=order_by,
                      include_metrics="all")

show ~2x speedup with database approach at the expense of file size (DB ~10x the size of raw parquet), more testing is needed.

This branch also addresses issues #69 and #42

samlamont commented 1 year ago

Next steps for TeehrDataset class:

mgdenno commented 1 year ago

Some notes while I am looking at this branch (before it is complete):

samlamont commented 8 months ago

@mgdenno I think this issue about the TEEHRDataset class can be closed, let me know if you'd like keep it open

mgdenno commented 2 months ago

Closing as superseded by v0.4 plans.