RTIInternational / teehr

Tools for Exploratory Evaluation in Hydrologic Research
https://rtiinternational.github.io/teehr/
MIT License
8 stars 10 forks source link

Look into ways to calculate more complex aggregates #185

Closed mgdenno closed 3 months ago

mgdenno commented 4 months ago

TEEHR currently uses SQL to calculate all aggregates (i.e. metrics). This has proven effective and fairly fast. However, as the required list of metrics grows and the complexity of those metrics also grows (ensemble, signature, event, etc.) the difficulty of implementing and maintaining them in SQL should be considered.

The current list of options seems to be:

Writing a DuckDB extension would likely be fast, but would not be easy for users/developers to add new metrics, or to take advantage of existing packages, for example a bootstrapping package, etc. This would have to be written in C++.

Pyspark seems to have many qualities that make it a good candidate for doing complex analytics on large and very large datasets (i.e. many billions of rows) in a horizontally scalable way. Initial tests have shown that it works well but is not as fast as something like DuckDB on datasets that fit in memory. This is perhaps a trade-off at smaller data volumes which would be slower than the current approach. How much data do we need to handle? Additionally, Pyspark has the benefit of allowing user defined aggregates that take multiple arguments (e.g., primary and secondary timeseries values), and seems to do it in an efficient way. For example, something like SELECT kge(prim, sec) FROM joined; where kge() is a Python function.

More to come...

mgdenno commented 3 months ago

We decided to use Spark.