Provide a direct integration with Datafusion

adriangb commented 4 months ago

I recently discovered that both polars and datafusion-python do not push down timestamp predicates correctly for pyarrow datasets. This is problematic: using a timestamp filter is a very, very common use case. I suspect both those libraries can implement / fix it but for datafusion going from datafusion (inside of delta-rs) -> pyarrow -> back to datafusion (in datafusion-python) seems like unecessary overhead and evidently somewhat brittle. And since the failure is silent it took me weeks to discover and I only found it because I noticed that using deltalake was a lot slower than just accessing the raw parquet data for certain queries.

Would this crate be opposed to offering a direct integration with datafusion, since datafusion is already used internally? This would be ideal as some sort of extra or plugin but sadly that's not really possible with the current state of PyO3 extensions.

Some ideas:

datafusion-python allows implementing a PythonTableProvider that while still requiring going from rust -> python -> rust would at least remove a brittle conversion to pyarrow and back and would be more akin to serializing to python and deserializing back to rust (maybe even using serde or https://github.com/davidhewitt/pythonize?). Basically this crate would have to make a python wrapper for the existing TableProvider that could then be passed into datafusion-python.
This crate depends on datafusion-python and re-exports the entire Python API.
We add datafusion-python as an optional cargo dependency and then re-export the API under detlake-datafusion (which you wouldn't be able to mix and match with plain deltalake e.g. passing a deltalake.DetlaTable between them, you'd have to make a deltalake_datafusion.DeltaTable although the latter maybe could extract information from the former or call it via the Python APIs).

ion-elgreco commented 4 months ago

Those issues with predicate pushdowns need to be fixed upstream though. I think option 1 is the least intrusive and allows you to make datafusion-python an optional dependency, so I think that could work.

Option 2 is a no-go. With option 3, at that point you should just create a new library that's maintained seperately

adriangb commented 4 months ago

Agreed those things need to be fixed upstream. It's unfortunate that both Polars and Datafusion are currently broken, I haven't tried DuckDB. So it's not possible to really compare DeltaLake vs hive or other alternatives currently.

I can open an issue in datafusion-python about exposing TableProvider.

ion-elgreco commented 1 month ago

Hey @adriangb! :) Do you have an example of PythonTableProvider how it can be passed into Datafusion-python? Was looking into this for a sec but I couldn't find any interface in their docs/code

Created this issue btw: https://github.com/apache/datafusion-python/issues/823

rtyler commented 1 month ago

What I had suggested to @ion-elgreco in Slack was to provide a SQL interface in Python to allow the Python layer to pass through DataFusion SQL and then get back RecordBatch objects which you could do something else with. I have similar needs that are kind of basic and this would meet them, would that be useful @adriangb ?

adriangb commented 1 month ago

Yeah I think a SQL layer would be a great start! That should be zero extra deps for delta-rs.

If there was some way to access the entire DataFusion APIs that would be nice. But I don't think that's possible right now with the state of PyO3 and sharing data between Rust extension modules.

Michael-J-Ward commented 1 month ago

@adriangb - have you tested pyarrow predicate pushdown on a recent release?

I think this PR solved the issue.

https://github.com/apache/datafusion-python/pull/735

delta-io / delta-rs

Provide a direct integration with Datafusion #2536