OSS-Latam / df-metrics

MIT License
1 stars 0 forks source link

Metrics Computing: DataFrame API #7

Open brayanjuls opened 3 months ago

brayanjuls commented 3 months ago

Design an API that help us support multiple DataFrame(polars,spark, pandas,etc) and convert them to the choosen processing engine native DataFrame API.

brayanjuls commented 3 months ago

The initial objective was to support multiple Dataframe APIs and a single backend for the execution of the query but from a UX point of view it doesn't make sense because if a user is using a different execution engine to process their data we would be forcing that user to use two backends just to use our library. Additionally, to support conversion between i.e polars and DataFusion we would need that both implement substrait format which is not the case and given recent conversation(see issue 7404) in the polars project it seems not be planned for the near future, nor Apache Spark(stuck pr) or Apache Flink support this format yet.

One alternative idea is to support multiple backends along with the frontend, meaning that if the user uses polars we would express and compute the metrics in polars and the like for each query engine or tool. This would require more work but will provide a better UX to the end user and reduce the complexity of implementation.

brayanjuls commented 3 months ago

This is an image of how it would look like,

image