chaoss / augur

Python library and web service for Open Source Software Health and Sustainability metrics & data collection. You can find our documentation and new contributor information easily here: https://oss-augur.readthedocs.io/en/main/ and learn more about Augur at our website https://augurlabs.io
https://oss-augur.readthedocs.io/en/main/
MIT License
583 stars 844 forks source link

Optimize the time-series storage #2240

Closed jonatas closed 1 year ago

jonatas commented 1 year ago

I went to an amazing workshop with @sgoggins and I'm very happy to understand how to use augur and the metrics disposed on chaoss.

I'm interested in knowing how useful would be adopt the timescaledb extension for all the time series data. Both commits and the timeline of the repositories could benefit from automatic partition of the hypertables that are available on timescaledb. Also, the compression would bring more storage efficiency and quick gains in performance for time series data as most of the queries are parallelized when using partitions.

The continuous aggregates could replace some of the materialized views and provide a better interface for hierarchical aggregations in multiple timeframes.

I see Timescaledb as also a great place to start more research leveraging the toolkit library and moving part of the data analysis to the database. The data locality could avoid back and forth data trips to process the data and make augur not only provide the raw data but also pre-processed indicators in the database levels.

The toolkit could potentially support several data-science research aspects that today are tight to python and depend on a third party library.

I'd love to hear from you if you're interested in adding such dependency, as I see a great potential on this addition.

I can help to introduce the filter by providing code, documentation or supporting and mentoring anyone that is interested in learning more about timeseries data inside postgresql.

sgoggins commented 1 year ago

Hi @jonatas : It was great meeting you at Scale last week! Do you have recommended links or places we might begin to explore time series storage using Timescaledb (A Postgresql plugin, if I recall correctly)?

jonatas commented 1 year ago

Yes, likewise @sgoggins!

Here are some links: Timescaledb Overview.

The main extension repo is here https://github.com/timescale/timescaledb

The toolkit is the Rust library which implements several data science functions which all play well with vanilla PostgreSQL as with time-series data: https://github.com/timescale/timescaledb-toolkit