TEEHR currently uses SQL to calculate all aggregates (i.e. metrics). This has proven effective and fairly fast. However, as the required list of metrics grows and the complexity of those metrics also grows (ensemble, signature, event, etc.) the difficulty of implementing and maintaining them in SQL should be considered.
The current list of options seems to be:
Continue to build out more and more complex SQL using DuckDB
Continue to use DuckDB but write an extension that will contain a bunch of water related aggregates (i.e. metrics)
Use PySpark which seems to support Python based aggregate functions in SQL queries.
Something else, Dask?
Writing a DuckDB extension would likely be fast, but would not be easy for users/developers to add new metrics, or to take advantage of existing packages, for example a bootstrapping package, etc. This would have to be written in C++.
Pyspark seems to have many qualities that make it a good candidate for doing complex analytics on large and very large datasets (i.e. many billions of rows) in a horizontally scalable way. Initial tests have shown that it works well but is not as fast as something like DuckDB on datasets that fit in memory. This is perhaps a trade-off at smaller data volumes which would be slower than the current approach. How much data do we need to handle? Additionally, Pyspark has the benefit of allowing user defined aggregates that take multiple arguments (e.g., primary and secondary timeseries values), and seems to do it in an efficient way. For example, something like SELECT kge(prim, sec) FROM joined; where kge() is a Python function.
TEEHR currently uses SQL to calculate all aggregates (i.e. metrics). This has proven effective and fairly fast. However, as the required list of metrics grows and the complexity of those metrics also grows (ensemble, signature, event, etc.) the difficulty of implementing and maintaining them in SQL should be considered.
The current list of options seems to be:
Writing a DuckDB extension would likely be fast, but would not be easy for users/developers to add new metrics, or to take advantage of existing packages, for example a bootstrapping package, etc. This would have to be written in C++.
Pyspark seems to have many qualities that make it a good candidate for doing complex analytics on large and very large datasets (i.e. many billions of rows) in a horizontally scalable way. Initial tests have shown that it works well but is not as fast as something like DuckDB on datasets that fit in memory. This is perhaps a trade-off at smaller data volumes which would be slower than the current approach. How much data do we need to handle? Additionally, Pyspark has the benefit of allowing user defined aggregates that take multiple arguments (e.g., primary and secondary timeseries values), and seems to do it in an efficient way. For example, something like
SELECT kge(prim, sec) FROM joined;
wherekge()
is a Python function.More to come...