MrPowers commented 9 months ago

Description

pandas users currently have to suffer with a difficult interface to query a Delta table. They need to manually perform column pruning and specify predicates in DNF form... before writing the pandas query with the same filtering logic.

Here's an example query:

(
    DeltaTable(f"{Path.home()}/data/deltalake_baseline_G1_1e9_1e2_0_0", version=0)
    .to_pandas(filters=[("id1", "==", "id016")], columns=["id1", "id2", "v1"])
    .astype({"v1": "int64"})
    .query("id1 == 'id016'")
    .groupby("id2")
    .agg({"v1": "sum"})
)

It could be nice to expose a DeltaTable#query method that is powered by DataFusion and gives pandas users a better experience. This could be the alternate logic.

(
    DeltaTable(f"{Path.home()}/data/deltalake_baseline_G1_1e9_1e2_0_0", version=0)
    .query("select id1, sum(v1) where id1 = 'id016' group by id2")
    .to_pandas()
)

This would help the pandas users avoid manually specifying the column pruning and predicate pushdown filtering logic. This syntax would also allow for this syntax, which would also be better:

(
    DeltaTable(f"{Path.home()}/data/deltalake_baseline_G1_1e9_1e2_0_0", version=0)
    .query("select id1, id1, v1 where id1 = 'id016'")
    .to_pandas()
    .groupby("id2")
    .agg({"v1": "sum"})
)

roeap commented 9 months ago

With the gradual move to logical plans, we are also moving closer and closer to being able properly expose a SQL API on the rust level as well. After which exposing this in Python should be straight forward :).

There is however some work to be done before we can get that to work. IF we want something quicker in python we could likely find a way by leveraging some other library. We would of course need to be careful not to loose pushdown optimizations etc...

MrPowers commented 9 months ago

@roeap - no rush on this one. But it would be a real game changer the more I think about it.

roeap commented 9 months ago

it would be a real game changer the more I think about it.

@MrPowers - agreed.

Looked into it a little bit and as it turns out, it is more feasible then I thought. Most critical I think is to come up with an API that we want to commit to.

There are also two related things that we can deal with separately:

1) executing queries 2) specifying expressions (partition filters etc.)

Since we already make use of a full blown SQL engine (datafusion) we already have the capability to execute SQL and just need to expose an API. However executing SQL always (AFAIK) requires some sort of session, which we have tried to avoid so far.

looking at the example above, when we do

DeltaTable(...).query("select * where id1 = 'id016'")

what we are actually executing is

select * from <some relation name> where id1 = 'id016'

which would require us to inject some missing parts into the query. This is quite error prone as users might come up with complex queries.

In polars they introduced SQLContext which seems to be an "ad-hoc" session where you register relations.

Without having thought about it too deeply, just throwing some spontaneous thoughts out there on what abstractions we may offer ... DeltaLog, DeltaTable, DeltaLake.

DeltaLog would be targeted at users that want to inspect the log to answer questions like - "is there a new version available", "what files are included in the log", e.t.c.. Some methods currently on DeltaTable - e.g. files, file_uris, get_add_actions - and some new ones, like peek_next_commit could be exposed there.

This could of course easily be exposed on DeltaTable

class DeltaTable:
  def log(self) -> DeltaLog: ...

but also created separately, to avoid doing an actual scan of the log to get protocol and metadata (some users asked for this specifically).

DeltaLake would be our session-like thing and allow you to register tables and or catalogs as well as execute queries.

class DeltaLake:
  def register_table(self, table: PathLike | DeltaTable): ...
  def register_catalog(self, ...):
  def sql(self, query: str) -> pyarrow.Table: ...

Again, just throwing things out there for discussion :)

djouallah commented 9 months ago

just expose a SQL interface, for Python API no one seems to agree on anything !!!

MrPowers commented 9 months ago

Great points @roeap!

Here's the syntax for querying a table with DataFusion:

from datafusion import SessionContext
ctx = SessionContext()
table = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
ctx.register_dataset("my_dataset", table.to_pyarrow_dataset())

I like the idea of this user interface:

table = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
table.register_table("my_dataset")
table.query("select id1, id1, v1 from my_dataset where id1 = 'id016'")

Let me know what you think!

emcake commented 9 months ago

Other options may be to follow the DuckDB pattern of letting you name the relation:

table = DeltaTable(...)
table.sql("my_name", "select * from my_name where id = 'foo'")

Or kwargs style parsing:

delta_table_a = DeltaTable(...)
delta_table_b = DeltaTable(...)

deltalake.sql("select a.id, a.foo, b.bar from relation_a a join relation_b b on a.id = b.id", relation_a = delta_table_a, relation_b = delta_table_b")

MrPowers commented 9 months ago

@emcake - I don't like the DuckDB pattern of letting you name the relation, but the kwargs style parsing is a great idea. Spark just added this too. This would be really nice:

table = DeltaTable(f"{pathlib.Path.home()}/data/delta/G1_1e8_1e2_0_0")
table.query("select id1, id1, v1 from {my_dataset} where id1 = 'id016'", my_dataset=table)

roeap commented 9 months ago

Personally I like the kwargs-syle idea very much! its clean, and users can make the decision right where they need to ...

The thing that still confuses me though is the hierarchy ... in @emcake's example we are assuming we have to higher level DeltaLake thing, right? since its deltalake.sql(...)

@MrPowers - i very much understand the urge to keep things simple and flat, but this notion would always require a reference to itself, so I could not write something like ...

result = DeltaTable(...).sql("select ...", relation=<what goes here?>)

which would work with @emcake first suggestion

result = DeltaTable(...).sql("my_table", "select ...")

If we brings back catalogues into python (and I hope we do) I think we will need a notion of something above table, since a catalogue usually contains more then one table.

emcake commented 9 months ago

The thing that still confuses me though is the hierarchy ... in @emcake's example we are assuming we have to higher level DeltaLake thing, right? since its deltalake.sql(...)

I was imagining just exposing it as a function in the package. In a sense it's nicer that way than doing it on a particular table, because it allows you to specify a query with multiple tables.

roeap commented 9 months ago

I was imagining just exposing it as a function in the package

ah that actually makes a lot of sense, and is very much in line with what we are doing with write_deltalake ---

djouallah commented 9 months ago

how about write, insert , merge, will those scenarios be supported ?

delta-io / delta-rs

Possibly expose a DeltaTable#query method #2115

Description