Polars Dataframe Backend

JesseFarebro commented 1 year ago

🚀 Feature

Provide a Dataframe API using Polars to improve query performance.

Motivation

Querying and working with large amounts of data in Aim using Pandas Dataframes could potentially be slow at times. It seems Aim processes queries sequentially from RocksDB (Perhaps this doesn't matter much as RocksDB is optimized for indexing? No need to scan?) and if you want to convert a large amount of data for processing with Pandas this workflow could break down quickly.

Pitch

Polars is a "next-generation" data frame library that could help to mitigate some of these issues. Polars already has backend implementations to read from a variety of sources including: CSV, Apache IPC, Apache Parquet, SQL, JSON, Apache Avro, etc.

The biggest advantage I'd see for Aim is the ability to lazily construct queries based on the operations users need perform to the Dataframe. Aim potentially wouldn't need to fetch the entire Dataframe apriori which could both help improve performance and bandwidth. Furthermore, once the data is fetched Polars performance is much better than Pandas (supporting many operations in parallel, backend implemented in Rust, etc.) resulting in an improved user experience.

Alternatives

One clear disadvantage is that the Polars Dataframe API isn't compatible with Pandas. Polars has a lot of momentum (8.9k Github stars, very active development) and the API isn't hard to pick up. Aim could always support a fallback to Pandas, e.g., Polars allows you to directly convert to a Pandas Dataframe.

gorarakelyan commented 1 year ago

Hey @JesseFarebro. Thanks for the recommendation! Definitely will look into it.

btw have you experienced perf. issues when working with dataframes of exported metadata so far? Is there a use case of fetching large volume of metadata at once? e.g. in case of metrics Aim by default fetches uniformly sampled 500-1000 points regardless tracked steps count. Are there any use-cases of fetching much more records?

JesseFarebro commented 1 year ago

@gorarakelyan I haven't really pushed things to their limits. I'll give you my average use case.

In RL say I just benchmarked 5 methods on 57 environments each having 5 seeds. I'll have 200 points per run that I want to load in a Dataframe to plot with Seaborn. This will be 285,000 points alone. In addition to this, I'll need all the metadata associated with these runs. This is a pretty light use case as well, a lot of the time instead of 200 points I could have thousands of points per run resulting in millions of points being returned in the Dataframe.

Having all this data in a single Dataframe is important for me as it makes plotting a lot easier (if you haven't seen the new Seaborn Object API it's definitely worth checking out, it's amazing).

gorarakelyan commented 1 year ago

@JesseFarebro is there a use case for retrieving all of the thousands points at once? Could an alternative approach like sampling or range querying the points and exporting the dataframe of selected points be considered here?

aimhubio / aim