apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
358 stars 71 forks source link

Show documentation how to use Delta table #414

Closed djouallah closed 2 weeks ago

djouallah commented 1 year ago

I think Delta rust is using Datafusion internally, I could not find any documentation though how to use Delta table with Python datafusion

wjones127 commented 1 year ago

I think Delta rust is using Datafusion internally

There's three senses in which we integrate with DataFusion:

  1. We use DataFusion components inside of our own functions
  2. We have a plugin for Rust DataFusion, but that can only be used from Rust
  3. We can export PyArrow datasets, which datafusion-python can read.

It's only the third one that applies to this library.

I could not find any documentation though how to use Delta table with Python datafusion

Our integration with the Python DataFusion is similar to DuckDB: create a PyArrow dataset, import that into DataFusion, and query as desired.

from datafusion import SessionContext
from deltalake import DeltaTable

# Create a DataFusion context
ctx = SessionContext()
delta_table = DeltaTable("path/to/your/table")
ctx.register_dataset(delta_table.to_pyarrow_dataset(), table_name="my_table")

df = ctx.sql("SELECT * FROM my_table")
djouallah commented 1 year ago

I see, I think it was a wishful thinking from my side and imagined somehow datafusion using delta table as a native storage with a full integration, I see that's not the case :(

wjones127 commented 1 year ago

Yeah to integrate like that we'd have to bundle the compiled delta-rs code within the datafusion-python wheels, which would make them quite large.

djouallah commented 1 year ago

@wjones127 so what you are saying basically, it is up to datafusion to bundle the delta-rs if they are interested ?