datafusion-contrib / datafusion-python

Python binding for DataFusion
https://arrow.apache.org/datafusion/python/index.html
Apache License 2.0
59 stars 12 forks source link

Support custom TableProvider #45

Closed jychen7 closed 2 years ago

jychen7 commented 2 years ago

Background

I would like to use datafusion-python to query Bigtable. In Rust, datafusion-bigtable have implement BigtableDataSource as custom TableProvider.

Problem

I tried to add register_table in https://github.com/datafusion-contrib/datafusion-python/pull/46 and expose a python BigtableTable in datafusion-bigtable at https://github.com/datafusion-contrib/datafusion-bigtable/pull/3.

The problem is how to convert python BigtableTable to python Table? Or how to serialize/deserialize rust TableProvider to some Python Object?

classDiagram
    BigtableTable_Python <|-- PyBigtableTable_Rust
    Table_Python <|-- PyTable_Rust
    TableProvider_Rust <|-- BigtableDatasource_Rust
    TableProvider_Rust <|-- ListingTable_Rust
    ListingTable_Rust <|-- CSV
    ListingTable_Rust <|-- Parquet
    ListingTable_Rust <|-- JSON
    ListingTable_Rust <|-- Avro
    class BigtableTable_Python{
    }
    class PyBigtableTable_Rust{
        table: TableProvider_Rust
    }
    class Table_Python{
    }
    class PyTable_Rust{
        table: TableProvider_Rust
    }

following is a non-working example, because bigtable.table() is TableProvider(Rust) and have no corresponding python object

from datafusion import ExecutionContext
from datafusion._internal import Table as DatafusionTable
from datafusion_bigtable import BigtableTable

@pytest.fixture
def df_table():
    bigtable = BigtableTable(
        project="emulator",
        xxx
    )
    return DatafusionTable(bigtable.table())
jychen7 commented 2 years ago

I believe it can work using https://pyo3.rs/v0.15.1/class.html?highlight=inheri#inheritance, close now

jychen7 commented 2 years ago

I try both inheritance and non-inheritance, compiling works, but pytest still show error

with inheritance, ctx.register_table("weather_balloons", bigtable_table) returns TypeError: argument 'table': 'BigtableTable' object cannot be converted to 'Table' https://github.com/datafusion-contrib/datafusion-bigtable/blob/014d02f26800402d37638113948d07197fb7b201/python/src/datasource.rs#L11-L12

without inheritance, ctx.register_table("weather_balloons", bigtable_table.to_pytable()) returns TypeError: argument 'table': 'Table' object cannot be converted to 'Table' https://github.com/datafusion-contrib/datafusion-bigtable/blob/fb2c794a33b5ee9234f7a9e24f2afebc7e17a7fb/python/src/datasource.rs#L56-L58


I have tried register_csv then use the PyTable to register_table as t1, it works. The weird thing is in following log, both t1 and t2 have same class/type, but t2 will fail register_table

(Pdb) ctx.register_csv("temp", "/path/to/temp.csv")

(Pdb) t1 = ctx.catalog().database("public").table("temp")
(Pdb) t1
<datafusion.Table object at 0x1055086f0>
(Pdb) ctx.register_table("t1", t1)
(Pdb) ctx.tables()
{'t1', 'temp'}

(Pdb) t2 = bigtable_table.to_pytable()
(Pdb) t2
<datafusion.Table object at 0x1055085a0>
(Pdb) ctx.register_table("t2", t2)
*** TypeError: argument 'table': 'Table' object cannot be converted to 'Table'
jychen7 commented 2 years ago

@Jimexist , sorry to bother, just wonder whether you have idea about how to resolve the type conversion error in https://github.com/datafusion-contrib/datafusion-python/issues/45#issuecomment-1087051568 (Not sure whether it is a limitation of pyo3, or I miss sth, seems almost there)

jychen7 commented 2 years ago

Looks like it is not supported in pyo3. According to https://github.com/PyO3/pyo3/issues/1444, even though datafusion-bigtable use PyTable from datafusion-python, after compile, pyo3 thinks the two PyTable are different types

The key issue is that #[pyclass] stores the pyclass type object in static storage. This means that (if Rust's usual rlib linkage is used) packages A and B will have their own copies of the MyClass type object, and Python will think that they're actually different types coming from the two packages.