dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️

https://dlthub.com/docs

Apache License 2.0

1.96k stars 114 forks source link

access data after load load as dataframes with ibis #1095

Open rudolfix opened 3 months ago

rudolfix commented 3 months ago

Background ibis https://github.com/ibis-project/ibis is a library that translates dataframe expressions into SQL statement and then executes them in the destination. they do nice work of compiling final SQL statement with sqlglot (so probably resultant SQL is quite optimized)

We have large overlap in destinations and we were looking for decent dataframe -> sql thing from the very start. it seems that's it: we can easily build a helper that exposes any dlt dataset as dataframe, share credentials etc.

Implementation We can integrate deeply or via a helper scheme. In case of helper, we allow users to get ibis connection from dlt destination and/or pipeline. The UX will be similar to dbt helper.

Deep integration means that we expose the loaded data from the Pipeline, DltSource and DltResource instances. ie.

pipeline.run(source, destination="bigquery", dataset_name="data")

# duckdb-like interface

# get rows dbapi style
for row in pipeline.dataset.sql("SELECT * FROM table"):
    ...

# get materialized dataframes
for batch in pipeline.dataset.sql("SELECT * FROM table").arrow(batch_size=10000):
    ...

# get lazy dataframe via ibis
ibis_table = pipeline.dataset.df(table_name)
ibis_connection = ibis_table = pipeline.dataset.df()

# we can expose the `dataset` property of the pipeline via source (and internal resources as well), in that case we automatically bind 
to right schema

ibis_table = resource.dataset.df()
ibis_connection = source.dataset.df()

Implementation is straightforward for sql-like destinations. We won't support vector databases. It would be really interesting to support filesystem destination as above. ie. by registering the json and parquet files in temporary duckdb database and then exposing the database for ibis and sql access methods

Ibis Connection sharing We are discussing a connection sharing approach with ibis here: https://github.com/ibis-project/ibis/issues/8877. As mentioned in the comments there, we could build it in a way that we manage the connection and ibis provides backends that accept an open connection and DO NOT need any addtionally dependencies.

lostmygithubaccount commented 3 months ago

hi @rudolfix, I'm working on Ibis and we were just discussing dlt and a potential integration! I'm curious how you found Ibis per:

we were looking for decent dataframe -> sql thing from the very start

we'd be happy to help move this along, particularly if there are any questions we can answer. in my cursory look at dlt and its destinations last week, basically all except the dedicated vector databases are covered as Ibis backends. let us know if we can help move this forward!

zilto commented 2 months ago

Hi to you both! I recently spent a decent amount of time with dlt + Ibis and I think there's a very clean abstraction to hand-off dlt to Ibis.

dlt side

From the dlt perspective, users pass credentials to create a connection to their Destination (several ways to do so). The connection is attached to the Pipeline and currently available through the .sql_client() method (user guide, SqlClientBase class). For example, the native duckdb connection is available through

pipeline.sql_client().open_connection()

ibis side

In the upcoming Ibis major release, backends are assigned a self.con attribute inside their .do_connect() method used for connection.

integration

To hand-off the connection from dlt to Ibis, I got this working

import ibis
import dlt

pipeline = dlt.pipeline(destination="duckdb, ...)

ibis.set_backend("duckdb")
ibis_connection = ibis.get_backend()  # will return non-connected backend

ibis_connection.con = pipeline.sql_client().open_connection()

ibis_connection.list_tables()   # will successfully read data

TODO

IMO, the integration should live under the dlt repo and be accessible through an API similar to the SQL client.

with pipeline.ibis_client() as client:
   client.list_tables()

With dlt and Ibis involve, we need to set explicit behaviors to open/close connections. For instance, how should Ibis behave if the dlt Pipeline closes its connection? I see there is a .native_connection attribute (ref), but I had trouble keeping the connection open if assigning it to ibis_connection.con
This extension will work for all dlt Destinations that are also an ibis Backend. We should handle gracefully unsupported hand-offs.
[not required] On the Ibis side, we could extend the top-level connection API to allow ibis.set_backend() to receive a native db connection (duckdb, postgres, etc.) and automatically assign the proper backend or a new method?

rudolfix commented 2 months ago

@lostmygithubaccount @zilto integrating ibis via ibis_client() is indeed straightforward and we could start there. initially I'd probably go with a helper function (like we have dbt and airflow helpers) that when provided with and instance of pipeline or destination would return working ibis client.

What about reading parquet files? There's a way to register a parquet file and query it. Are we able to register parquet files with fsspec? what about parquet files that have different schemas (but backward compatible - columns are appended). Could we still register and query them?

My goal here is to use ibis as the dataframe engine :) and expose it as I imagined in the initial post. so whenever users want to interact with dataset via dataframes, they get ibis client, if they want to interact via sql they get (more or less) dbapi connection. the interface is partially inspired by what duckdb does. what is your take on this? maybe I go to far with hiding what is really under the hood.

rudolfix commented 2 months ago

@zilto

With dlt and Ibis involve, we need to set explicit behaviors to open/close connections. For instance, how should Ibis behave if the dlt Pipeline closes its connection? I see there is a .native_connection attribute (ref), but I had trouble keeping the connection open if assigning it to ibis_connection.con

here we could to two things:

pass the credentials to ibis, not the open connection. this needs more work unfortunately, in most cases credentials will match (connection string) but then ibis will manage connections itself. this will also cover all weird cases with multithreaded code etc. right?
we could change sql_client interface to "borrow" connection to external object that will be responsible for managing it and sql_client goes into closed state. something like rust does...

zilto commented 2 months ago

Doing ELT

With Extract, Transform, Load (ETL) as a frame of reference, a dlt pipeline does EL and the first valuable integration would be adding T with Ibis. A core value prop of Ibis is executing transformations directly on the backend where you loaded data and potentially returning results in-memory.

To make this possible:

Ibis needs to connect to the backend (one of your two solution, Ibis team will be more helpful here)
dlt needs to tell Ibis where the data is. For example, while working on Hamilton, I found dlt stores data in specific duckdb schema/database (ref). Note that this Ibis API is currently changing

^This is where there's immediate value, just needs a bit of coordination

Doing ETL

I'm now more familiar about the dlt internals (extract, normalize, transform), using the loader_file_format="parquet" we effectively have E -> parquet -> L. @rudolfix is this why there was mention of parquet? We could effectively load extracted data in batch or streaming via Ibis T between dlt's E and L steps.

Ibis code is primarily about building "expressions" until an "execution operation" (e.g., insert data, return as dataframe). To start defining expressions, Ibis needs a table which can be an actual table on a backend, a memtable (ref) that you can load from pyarrow.Table (loader files), or even just a schema with table (ref). For you question about columns, this is cool because it means we could validate Ibis expression based on dlt schemas, without passing any data through

The dlt schema evolution / data contract + the Ibis and Substrait relational algebra could provide full lineage and granular "diff" and visibility over breaking changes

sh-rp commented 1 week ago

I have experimented a bit with this here: https://github.com/dlt-hub/dlt/pull/1491. There is no proper way to hand over native connections to ibis backends at the moment. For the moment I am getting the backends and just setting the .con property, but this does not work for most destinations, so there'd have to be some work on the ibis project to get this to work.

sh-rp commented 1 week ago

@lostmygithubaccount are there any plans to allow sharing of an open connection with ibis? You can see in my code that I am just setting the con property, but a lot of setup code that would run when executing do_connectwill actually not run in this case.

lostmygithubaccount commented 1 week ago

hi @sh-rp, let me try to pull in one of the more experienced engineers on the team -- some initial answers:

you can sometimes get the connection from the client connection, e.g.:

[ins] In [1]: import ibis

[ins] In [2]: con = ibis.get_backend()

[ins] In [3]: con
Out[3]: <ibis.backends.duckdb.Backend at 0x1081d5750>

[ins] In [4]: con.con
Out[4]: <duckdb.duckdb.DuckDBPyConnection at 0x10c8dbab0>

I don't know if using this while using Ibis at the same time is well-defined behavior

then there is an open issue w/ this ask: https://github.com/ibis-project/ibis/issues/8877

sh-rp commented 1 week ago

@lostmygithubaccount Ah yes, thanks for pointing me to that issue, that is exactly what I'd need. I'll comment there.