apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
356 stars 71 forks source link

Expose API to register a foreign TableProvider #823

Open ion-elgreco opened 1 month ago

ion-elgreco commented 1 month ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. In delta-rs we have a TableProvider on DeltaTable on the rust side, we would like to leverage this in datafusion-python, so that we can make use the scanning capabilities of datafusion. However I don't see an API where we can register a table which as TableProvider in rust

Describe the solution you'd like Provide a means to allow registering Tables that implement TableProvider in rust through python

Describe alternatives you've considered Couldn't find any.

Related issues https://github.com/delta-io/delta-rs/issues/1204

timsaucer commented 1 month ago

I'm trying to think through how best to support this. From preliminary investigation into how we're handling arrow dataset, we create a table provider in src/dataset.rs but a fair amount of the actual interface code is in src/dataset_exec.rs. I suspect what we need to do is implement a PyCapsule approach and do some checking to ensure the versions are compatible. I still need to investigate further, but this seems like a useful thing to do.

timsaucer commented 1 month ago

For Table Provider, what I've been investigating is how we could do something like register_table_provider. But even scratching the surface of making this stable across versions is a deep rabbit hole. Suppose I only wanted to support scan in a Table Provider. I need to export a stable version of Session, Expr, and ExecutionPlan. But then Session and ExecutionPlan require even more.

On the one hand, I can see wanting to just implement register_delta_table and add code specific to the python interface on delta-rs. That might be the fastest way to get us where we want to be but it's not the best long-term solution probably.

ion-elgreco commented 1 month ago

For Table Provider, what I've been investigating is how we could do something like register_table_provider. But even scratching the surface of making this stable across versions is a deep rabbit hole. Suppose I only wanted to support scan in a Table Provider. I need to export a stable version of Session, Expr, and ExecutionPlan. But then Session and ExecutionPlan require even more.

On the one hand, I can see wanting to just implement register_delta_table and add code specific to the python interface on delta-rs. That might be the fastest way to get us where we want to be but it's not the best long-term solution probably.

A register_delta_table in Python Datafusion? That might bloat your binary since you include delta-rs and it will limit your Datafusion version.

I think pola-rs has built their plugins around a C interface, haven't dived to deep in their internals yet

timsaucer commented 4 weeks ago

No, sorry, I wasn't clear in what I was suggesting. I was wondering if we should expose something along the lines of register_delta_table and then provide a FFI stable interface that narrowly supports the needs of delta-rs. As I'm diving in deeper I'm discovering more difficulties.

I think this topic is complex. Since different versions of the rust compiler and different versions of datafusion would all lead to different binary layouts, it's really not as simple as exposing register_table_provider. One way around that could be to create a compiler dependency for any official wheels - but then we're adding a new requirement on delta-rs which seems fragile long term, especially if you want to update rustc versions (or we do and you want to keep them where they are).

Lowest level effort to get this up and running would be the last idea. Maybe that's okay? I'd hate to add that kind of build dependency. Also anyone who is developing would have to make sure they either use the same or build both wheels locally.

My thoughts are a little jumbled.

timsaucer commented 3 weeks ago

I've done some additional testing with mixed success.

Approach 1: Direct Expose

In this approach we basically just expose a function register_table_provider that takes a Arc<dyn TableProvider> and put this in a simple structure we can expose via PyCapsule.

Approach 2: Create FFI Table Provider

In this approach we define a true FFI friendly Table Provider. We expose a PyCapsule with this table provider.

Evaluation

I have each of these approaches working in a minimal fashion.

For the direct expose it is working except I have an odd failure when trying to do a show() on the data frame. Oddly, I can execute the dataframe and do count() and other operations. There is some odd dependency along the line that is causing some fault during projection that I'm struggling to troubleshoot.

For the FFI table provider, I've got the round trip working where we can get the schema from the table provider through FFI and I intentionally built them in different compiler modes to ensure the internal representations differed. The part I'm stuck on here is how much would have to be exposed to get all required functions of TableProvider working.

I'm open to thoughts and suggestions.

ion-elgreco commented 3 weeks ago

Approach 2 looks to be most promising since we could have many different versions of deltalake be compatible with Datafusion-python.

Regarding session, what is exactly configurable from python? Wouldn't those config settings be easy to pass across?

timsaucer commented 3 weeks ago

I'm in favor of the second approach also, but the concern is the depth of options that have to be exposed. I suppose instead of trying to expose Session we can limit it to SessionConfig that would work, especially since we can pass a set of strings and create a SessionConfig from them. The other one it seems like delta-rs uses is RuntimeEnv. I don't know how that's used.

mesejo commented 3 weeks ago

I don't know if this is useful, but in here you have a POC creating a PyTableProvider, exposing part of TableProvider trait to Python, the same way is done for UDAF (Accumulator). The idea would be to wrap any Python object that returns RecordBatch(es). This goes through Python, though.

In Python the dev would have to do something like this

timsaucer commented 3 weeks ago

That's a really good idea. I was thinking of it entirely from the direction of exposing the api but maybe what we should be doing is leaning on going through python like you suggest.