feat: Support for persisting metrics into a scalable time series database

brightsparc commented 2 years ago

🚀 Feature

This feature request is to support a native time series database that supports the same rich query interface, but persist data in an efficient format that could also be rolled up, and optionally offloaded/archived over time.

Motivation

Aim currently writes metrics to rocks db which makes scaling out a tracking api challenging. It also is unlikely to support a multi-tenant environment which could have thousands or millions of different experiments.

Pitch

Explore open source time series options that have demonstrate the ability to capture and report on metrics at scale for example:

Grafana Loki which supports metric queries and the LogQL log query language
M3 distributed TSBD is the storage layer for chronosphere and supports PromQL query language
Redis TimeSeries is a more low level solution, that provides fast in-memory filtering of time series including support for downsampling and aggregation

Alternatives

Alternatively you could explore using a SaaS solution such as Influxdb or aiven

At the other end of the spectrum rolling your own custom distributed TSDB on top of something like Apache Bookkeeper which provides an efficient write ahead log, and native offloading to cloud object stores.

Additional context

This feature would enable the solution to scale to many thousands or millions of different experiments, which would differentiate itself from mlflow.

brightsparc commented 2 years ago

Another option could be to use clickhouse as it is an open source fast OLAP store. It also supports a number of different log engines which could be a good fit for metrics. It also supports an embedded rocks db https://clickhouse.com/docs/en/engines/table-engines/integrations/embedded-rocksdb/

see this thread for more on engine types https://www.alibabacloud.com/blog/selecting-a-clickhouse-table-engine_597726

alberttorosyan commented 2 years ago

@brightsparc thanks for the comment. Currently we are focused on centralized tracking server implementation which will allow us to have more flexible setup on a server side. Clickhouse is a good choice for large volumes of data and it has a powerful features for data aggregation and sampling. It's a bit problematic in a sense of deleting data, but those issues are solvable. Taking into account requirements Aim has to the database, it's worth to consider time series databases. Once we start working on this, we'll make sure to re-design Aim in a way that storage backend can be changed without breaking SDK and other UIs.

pablete commented 2 years ago

Are there any updates towards this goal? It will be very useful to have one implementation of a storage backend so people can start contributing with different time series databases, and their performance.

gerilya commented 1 year ago

@brightsparc thanks for the comment. Currently we are focused on centralized tracking server implementation which will allow us to have more flexible setup on a server side. Clickhouse is a good choice for large volumes of data and it has a powerful features for data aggregation and sampling. It's a bit problematic in a sense of deleting data, but those issues are solvable. Taking into account requirements Aim has to the database, it's worth to consider time series databases. Once we start working on this, we'll make sure to re-design Aim in a way that storage backend can be changed without breaking SDK and other UIs.

I think that adding support for multiple backends would help the adoption enormously. I just learned about this project few days ago and was very impressed by its UI but without this kind of interoperability it seems like integrating it into existing infrastructure of a mature project would require too much of an effort.

aimhubio / aim