Open rudolfix opened 3 months ago
hi @rudolfix, I'm working on Ibis and we were just discussing dlt
and a potential integration! I'm curious how you found Ibis per:
we were looking for decent dataframe -> sql thing from the very start
we'd be happy to help move this along, particularly if there are any questions we can answer. in my cursory look at dlt
and its destinations last week, basically all except the dedicated vector databases are covered as Ibis backends. let us know if we can help move this forward!
Hi to you both! I recently spent a decent amount of time with dlt + Ibis and I think there's a very clean abstraction to hand-off dlt to Ibis.
From the dlt perspective, users pass credentials to create a connection to their Destination
(several ways to do so). The connection is attached to the Pipeline
and currently available through the .sql_client()
method (user guide, SqlClientBase class). For example, the native duckdb connection is available through
pipeline.sql_client().open_connection()
In the upcoming Ibis major release, backends are assigned a self.con
attribute inside their .do_connect()
method used for connection.
To hand-off the connection from dlt to Ibis, I got this working
import ibis
import dlt
pipeline = dlt.pipeline(destination="duckdb, ...)
ibis.set_backend("duckdb")
ibis_connection = ibis.get_backend() # will return non-connected backend
ibis_connection.con = pipeline.sql_client().open_connection()
ibis_connection.list_tables() # will successfully read data
with pipeline.ibis_client() as client:
client.list_tables()
With dlt and Ibis involve, we need to set explicit behaviors to open/close connections. For instance, how should Ibis behave if the dlt Pipeline
closes its connection? I see there is a .native_connection
attribute (ref), but I had trouble keeping the connection open if assigning it to ibis_connection.con
This extension will work for all dlt Destinations
that are also an ibis Backend
. We should handle gracefully unsupported hand-offs.
[not required] On the Ibis side, we could extend the top-level connection API to allow ibis.set_backend()
to receive a native db connection (duckdb, postgres, etc.) and automatically assign the proper backend or a new method?
@lostmygithubaccount @zilto integrating ibis via ibis_client()
is indeed straightforward and we could start there. initially I'd probably go with a helper function (like we have dbt and airflow helpers) that when provided with and instance of pipeline or destination would return working ibis client.
What about reading parquet files? There's a way to register a parquet file and query it. Are we able to register parquet files with fsspec
? what about parquet files that have different schemas (but backward compatible - columns are appended). Could we still register and query them?
My goal here is to use ibis as the dataframe engine :) and expose it as I imagined in the initial post. so whenever users want to interact with dataset via dataframes, they get ibis client, if they want to interact via sql they get (more or less) dbapi connection. the interface is partially inspired by what duckdb does. what is your take on this? maybe I go to far with hiding what is really under the hood.
@zilto
With dlt and Ibis involve, we need to set explicit behaviors to open/close connections. For instance, how should Ibis behave if the dlt Pipeline closes its connection? I see there is a .native_connection attribute (ref), but I had trouble keeping the connection open if assigning it to ibis_connection.con
here we could to two things:
sql_client
interface to "borrow" connection to external object that will be responsible for managing it and sql_client
goes into closed state. something like rust does...With Extract, Transform, Load (ETL
) as a frame of reference, a dlt pipeline does EL
and the first valuable integration would be adding T
with Ibis. A core value prop of Ibis is executing transformations directly on the backend where you loaded data and potentially returning results in-memory.
To make this possible:
^This is where there's immediate value, just needs a bit of coordination
I'm now more familiar about the dlt internals (extract, normalize, transform), using the loader_file_format="parquet"
we effectively have E -> parquet -> L
. @rudolfix is this why there was mention of parquet? We could effectively load extracted data in batch or streaming via Ibis T
between dlt's E and L steps.
Ibis code is primarily about building "expressions" until an "execution operation" (e.g., insert data, return as dataframe). To start defining expressions, Ibis needs a table
which can be an actual table on a backend, a memtable
(ref) that you can load from pyarrow.Table
(loader files), or even just a schema with table
(ref). For you question about columns, this is cool because it means we could validate Ibis expression based on dlt schemas, without passing any data through
The dlt schema evolution / data contract + the Ibis and Substrait relational algebra could provide full lineage and granular "diff" and visibility over breaking changes
I have experimented a bit with this here: https://github.com/dlt-hub/dlt/pull/1491. There is no proper way to hand over native connections to ibis backends at the moment. For the moment I am getting the backends and just setting the .con property, but this does not work for most destinations, so there'd have to be some work on the ibis project to get this to work.
@lostmygithubaccount are there any plans to allow sharing of an open connection with ibis? You can see in my code that I am just setting the con
property, but a lot of setup code that would run when executing do_connect
will actually not run in this case.
hi @sh-rp, let me try to pull in one of the more experienced engineers on the team -- some initial answers:
[ins] In [1]: import ibis
[ins] In [2]: con = ibis.get_backend()
[ins] In [3]: con
Out[3]: <ibis.backends.duckdb.Backend at 0x1081d5750>
[ins] In [4]: con.con
Out[4]: <duckdb.duckdb.DuckDBPyConnection at 0x10c8dbab0>
I don't know if using this while using Ibis at the same time is well-defined behavior
then there is an open issue w/ this ask: https://github.com/ibis-project/ibis/issues/8877
@lostmygithubaccount Ah yes, thanks for pointing me to that issue, that is exactly what I'd need. I'll comment there.
Background ibis https://github.com/ibis-project/ibis is a library that translates dataframe expressions into SQL statement and then executes them in the destination. they do nice work of compiling final SQL statement with sqlglot (so probably resultant SQL is quite optimized)
We have large overlap in destinations and we were looking for decent dataframe -> sql thing from the very start. it seems that's it: we can easily build a helper that exposes any dlt dataset as dataframe, share credentials etc.
Implementation We can integrate deeply or via a helper scheme. In case of helper, we allow users to get
ibis
connection fromdlt
destination and/or pipeline. The UX will be similar todbt
helper.Deep integration means that we expose the loaded data from the
Pipeline
,DltSource
andDltResource
instances. ie.Implementation is straightforward for sql-like destinations. We won't support vector databases. It would be really interesting to support filesystem destination as above. ie. by registering the json and parquet files in temporary duckdb database and then exposing the database for
ibis
andsql
access methodsIbis Connection sharing We are discussing a connection sharing approach with ibis here: https://github.com/ibis-project/ibis/issues/8877. As mentioned in the comments there, we could build it in a way that we manage the connection and ibis provides backends that accept an open connection and DO NOT need any addtionally dependencies.