Open ion-elgreco opened 1 month ago
I'm trying to think through how best to support this. From preliminary investigation into how we're handling arrow dataset, we create a table provider in src/dataset.rs
but a fair amount of the actual interface code is in src/dataset_exec.rs
. I suspect what we need to do is implement a PyCapsule approach and do some checking to ensure the versions are compatible. I still need to investigate further, but this seems like a useful thing to do.
For Table Provider, what I've been investigating is how we could do something like register_table_provider
. But even scratching the surface of making this stable across versions is a deep rabbit hole. Suppose I only wanted to support scan
in a Table Provider. I need to export a stable version of Session
, Expr
, and ExecutionPlan
. But then Session
and ExecutionPlan
require even more.
On the one hand, I can see wanting to just implement register_delta_table
and add code specific to the python interface on delta-rs. That might be the fastest way to get us where we want to be but it's not the best long-term solution probably.
For Table Provider, what I've been investigating is how we could do something like
register_table_provider
. But even scratching the surface of making this stable across versions is a deep rabbit hole. Suppose I only wanted to supportscan
in a Table Provider. I need to export a stable version ofSession
,Expr
, andExecutionPlan
. But thenSession
andExecutionPlan
require even more.On the one hand, I can see wanting to just implement
register_delta_table
and add code specific to the python interface on delta-rs. That might be the fastest way to get us where we want to be but it's not the best long-term solution probably.
A register_delta_table
in Python Datafusion? That might bloat your binary since you include delta-rs and it will limit your Datafusion version.
I think pola-rs has built their plugins around a C interface, haven't dived to deep in their internals yet
No, sorry, I wasn't clear in what I was suggesting. I was wondering if we should expose something along the lines of register_delta_table
and then provide a FFI stable interface that narrowly supports the needs of delta-rs. As I'm diving in deeper I'm discovering more difficulties.
I think this topic is complex. Since different versions of the rust compiler and different versions of datafusion would all lead to different binary layouts, it's really not as simple as exposing register_table_provider
. One way around that could be to create a compiler dependency for any official wheels - but then we're adding a new requirement on delta-rs which seems fragile long term, especially if you want to update rustc versions (or we do and you want to keep them where they are).
Lowest level effort to get this up and running would be the last idea. Maybe that's okay? I'd hate to add that kind of build dependency. Also anyone who is developing would have to make sure they either use the same or build both wheels locally.
My thoughts are a little jumbled.
I've done some additional testing with mixed success.
In this approach we basically just expose a function register_table_provider
that takes a Arc<dyn TableProvider>
and put this in a simple structure we can expose via PyCapsule.
datafusion-python
. It's really a matter of exposing the PyCapsule and puts all of the responsibility of making it work on the consumer.datafusion-python
but they must also use the same arrow dependencies and the same compiler version to get compatible binaries. This means that every version of delta-rs would have to update in lock step with datafusion - including sub dependencies AND compiler version. From a release perspective this means users would have to be specific about which versions of datafusion and delta-rs packages they use, leading to a lot of difficult edge cases.In this approach we define a true FFI friendly Table Provider. We expose a PyCapsule with this table provider.
Session
as well, which then exposes a lot more. Now, if we could make some argument about only allowing external table providers that do not require the session that would vastly simplify the problem but I do see that delta-rs is using things like the runtime environment and the session config.I have each of these approaches working in a minimal fashion.
For the direct expose it is working except I have an odd failure when trying to do a show()
on the data frame. Oddly, I can execute the dataframe and do count()
and other operations. There is some odd dependency along the line that is causing some fault during projection that I'm struggling to troubleshoot.
For the FFI table provider, I've got the round trip working where we can get the schema from the table provider through FFI and I intentionally built them in different compiler modes to ensure the internal representations differed. The part I'm stuck on here is how much would have to be exposed to get all required functions of TableProvider
working.
I'm open to thoughts and suggestions.
Approach 2 looks to be most promising since we could have many different versions of deltalake be compatible with Datafusion-python.
Regarding session, what is exactly configurable from python? Wouldn't those config settings be easy to pass across?
I'm in favor of the second approach also, but the concern is the depth of options that have to be exposed. I suppose instead of trying to expose Session
we can limit it to SessionConfig
that would work, especially since we can pass a set of strings and create a SessionConfig
from them. The other one it seems like delta-rs uses is RuntimeEnv
. I don't know how that's used.
I don't know if this is useful, but in here you have a POC creating a PyTableProvider, exposing part of TableProvider trait to Python, the same way is done for UDAF (Accumulator). The idea would be to wrap any Python object that returns RecordBatch(es). This goes through Python, though.
In Python the dev would have to do something like this
That's a really good idea. I was thinking of it entirely from the direction of exposing the api but maybe what we should be doing is leaning on going through python like you suggest.
Is your feature request related to a problem or challenge? Please describe what you are trying to do. In delta-rs we have a TableProvider on DeltaTable on the rust side, we would like to leverage this in datafusion-python, so that we can make use the scanning capabilities of datafusion. However I don't see an API where we can register a table which as TableProvider in rust
Describe the solution you'd like Provide a means to allow registering Tables that implement TableProvider in rust through python
Describe alternatives you've considered Couldn't find any.
Related issues https://github.com/delta-io/delta-rs/issues/1204