elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.06k stars 117 forks source link

Expose the Polars SQL API? #818

Open cigrainger opened 6 months ago

cigrainger commented 6 months ago

https://crates.io/crates/polars-sql https://docs.rs/polars-sql/0.36.2/polars_sql/fn.sql_expr.html https://docs.pola.rs/py-polars/html/reference/sql

I'm not a huge fan of python polars here with registering etc explicitly. It feels very... Spark-ish to me.

I like the way dplyr allows you to just write sql(...). https://dbplyr.tidyverse.org/articles/sql.html

We could have Explorer.DataFrame.sql(df|lf, "statement").

I'm very open to differing opinions on this!

josevalim commented 6 months ago

At first, I am -1, unless we have a very strong use case for this. Explorer is meant to be agnostic, and we work hard for this to be the case, but the sql semantics would be necessarily tied to polars here.

cigrainger commented 6 months ago

It definitely is a big win for OLAP use cases. I'm not sure I understand why we'd be tied to Polars SQL semantics here -- for an explorer_sql backend it can just be passed through directly. Backends don't necessarily have to support it. And it can be documented that the accepted SQL queries depend on the backend.

josevalim commented 6 months ago

In explorer_sql, sql is how you load data into memory. Polars SQL is a language for manipulating an existing in-memory dataframe. So they sit at different levels. And if I use Polars SQL to manipulate a Polars DataFrame, I won't be able to use to manipulate a Explorer SQL dataframe.

cigrainger commented 6 months ago

Ah but that's not how I'm imagining explorer_sql -- the idea is not just to load the data into memory (in which case there's no point -- just use ADBC or whatever else) but to convert Explorer functions to SQL and run the queries on the database itself (compute) until you pull it down (collect). dbplyr uses tmp tables. You can also fix to actual tables.

I don't see a problem with the SQL syntax being backend dependent -- since it's passing through it would be handled by the backend anyway. This is also how dbplyr works if I'm not mistaken.

Re: compute and collect from dbplyr:

compute() stores results in a remote temporary table. collect() retrieves data into a local tibble. collapse() is slightly different: it doesn't force computation, but instead forces generation of the SQL query. This is sometimes needed to work around bugs in dplyr's SQL generation.

josevalim commented 6 months ago

Interesting. That would indeed close the gap, so I am ok with going down this path, although I wonder if we should tackle Explorer SQL first to have a better of how it will all connect. :)

cigrainger commented 6 months ago

I'm cool with that! Though I'm pretty adamant that in explorer_sql this function would basically just act as a transparent 'passthrough' escape hatch, permitting you to run raw SQL on the db directly.

billylanchantin commented 6 months ago

I'm super +1 on SQL interop, and if either exposing the Polars SQL API or working on Explorer SQL are the path forward, then I'm behind them.

I admit I'm still unclear on:

Wrapping my head around that those things is my personal barrier to +1 to either specific approach. I'm not against either! I just don't think I have all the facts yet.

In particular, I think outlining a few specific use cases with pseudo-code would be instructive. I can't tell, for instance, how lazy frames play with this feature.

josevalim commented 5 months ago

I don't see a problem with the SQL syntax being backend dependent -- since it's passing through it would be handled by the backend anyway. This is also how dbplyr works if I'm not mistaken.

I have just realized that, we don't use the exact same names as polars. So if we expose polars_sql, it would start to feel very janky. For example, nil vs null and probably several others. So ultimately, I believe I am a -1 on this approach. If we want to go down this road, it may make more sense to introduce our own SQL that compiles down to dataframe operations.