Closed samlamont closed 2 months ago
In the last CIROH call, Andy had a question about whether the user could specify their own metrics to calculate in the query side -- could we build functionality to pass a function in the query to calculate some custom metric? (or does this already exist?).
cc. @mgdenno
Potential approach:
dataset = TeehrDataset(
primary_filepath = "",
secondary_filepath = "",
locations = "",
etc.
)
dataset.join()
dataset.validate()
dataset.add_field(
name="",
some_function()
)
dataset.get_metrics(filters, etc.)
Initial comparisons of DB-based query vs. parquet-based query, ex.
order_by = ["lead_time", "primary_location_id"]
group_by = ["lead_time", "primary_location_id"]
df1 = tqu.get_metrics(DATABASE_FILEPATH,
group_by=group_by,
order_by=order_by,
include_metrics="all")
show ~2x speedup with database approach at the expense of file size (DB ~10x the size of raw parquet), more testing is needed.
This branch also addresses issues #69 and #42
Next steps for TeehrDataset class:
Some notes while I am looking at this branch (before it is complete):
create_joined_timeseries_table()
should rather be an insert_joined_timeseries()
function and take the paths to the relevant parquet files. This would allow it to be run multiple time for different configurations
to the same database. This would also make initializing with an existing database easier - basically the initialization would be the same - it would only take path to database file.field_dtype
would be good in _add_field_name_to_joined_timeseries(()
drop_first
a calculated field before adding it get_unique_field_values(field_name)
method that would return all the distinct values for a fieldget_timeseries()
methods. Maybe separate methods for primary and secondary, or an argument to indicate which.queries/duckdb_database.py
file, some take the database_filepath
and some take a duckdb.connection()
, some return a dict
and some return a pd.Dataframe
. The less consistent it is the harder it is to reason about what it the code is doing. I think we need to make the way we connect to the database, and construct and execute queries consistent across all methods. join_and_save_timeseries(include_geometry, geomerty_filepath)
are not in use. The related parts of the query can also be removed.__main__
is OK during initial testing but tests should be in tests.@mgdenno I think this issue about the TEEHRDataset class can be closed, let me know if you'd like keep it open
Closing as superseded by v0.4 plans.
Summary Brainstorming notes on building a pre-processor to convert raw data to TEEHR data model (de-coupling the processes of downloading raw data and importing to TEEHR). This would allow us to optimize each process independently
Benefits • By ignoring the steps to format to TEEHR data model NWM point data can be downloaded much more efficiently (ex. 2 days of medium range forecasts for all reaches < 10 mins on large 2i2c instance; single jsons --> multizarrtozarr --> dask dataframe --> partitioned parquet) • The pre-processor can import raw data directly into a duckdb database, potentially improving query performance (as opposed to querying parquet files) • The pre-processor can include methods to calculate additional fields (eg. binary exceedence flags) to further optimize query performance • A standalone tool will provide more flexibility for users to import raw data to the TEEHR data model (likely necessary for testbeds) • Components: -- Field mapping to connect raw data attributes to TEEHR data model -- Methods to convert local files (parquet, csv, etc) to TEEHR data model -- Pydantic models of TEEHR data model tables
cc. @mgdenno