Open adriangb opened 4 months ago
Those issues with predicate pushdowns need to be fixed upstream though. I think option 1 is the least intrusive and allows you to make datafusion-python an optional dependency, so I think that could work.
Option 2 is a no-go. With option 3, at that point you should just create a new library that's maintained seperately
Agreed those things need to be fixed upstream. It's unfortunate that both Polars and Datafusion are currently broken, I haven't tried DuckDB. So it's not possible to really compare DeltaLake vs hive or other alternatives currently.
I can open an issue in datafusion-python about exposing TableProvider.
Hey @adriangb! :) Do you have an example of PythonTableProvider how it can be passed into Datafusion-python? Was looking into this for a sec but I couldn't find any interface in their docs/code
Created this issue btw: https://github.com/apache/datafusion-python/issues/823
What I had suggested to @ion-elgreco in Slack was to provide a SQL interface in Python to allow the Python layer to pass through DataFusion SQL and then get back RecordBatch
objects which you could do something else with. I have similar needs that are kind of basic and this would meet them, would that be useful @adriangb ?
Yeah I think a SQL layer would be a great start! That should be zero extra deps for delta-rs.
If there was some way to access the entire DataFusion APIs that would be nice. But I don't think that's possible right now with the state of PyO3 and sharing data between Rust extension modules.
@adriangb - have you tested pyarrow predicate pushdown on a recent release?
I think this PR solved the issue.
I recently discovered that both polars and datafusion-python do not push down timestamp predicates correctly for pyarrow datasets. This is problematic: using a timestamp filter is a very, very common use case. I suspect both those libraries can implement / fix it but for datafusion going from datafusion (inside of delta-rs) -> pyarrow -> back to datafusion (in datafusion-python) seems like unecessary overhead and evidently somewhat brittle. And since the failure is silent it took me weeks to discover and I only found it because I noticed that using deltalake was a lot slower than just accessing the raw parquet data for certain queries.
Would this crate be opposed to offering a direct integration with datafusion, since datafusion is already used internally? This would be ideal as some sort of extra or plugin but sadly that's not really possible with the current state of PyO3 extensions.
Some ideas:
detlake-datafusion
(which you wouldn't be able to mix and match with plaindeltalake
e.g. passing adeltalake.DetlaTable
between them, you'd have to make adeltalake_datafusion.DeltaTable
although the latter maybe could extract information from the former or call it via the Python APIs).