duckdb / dbt-duckdb

dbt (http://getdbt.com) adapter for DuckDB (http://duckdb.org)
Apache License 2.0
787 stars 70 forks source link

Support for delta #241

Open Yacobolo opened 10 months ago

Yacobolo commented 10 months ago

Looking forward to the support for delta. This would enable us to run a poor man's data lakehouse! Do you need any help? What is the eta - this year?

jwills commented 9 months ago

Ack, sorry for the lag here @Yacobolo, I was on the road and missed this going by. I would like to have a plugin that supported Delta akin to the one I have for Iceberg; I'm assuming it would use the deltalake python package, but I personally don't have access to a Delta lake instance and tbh don't really care enough about learning how to setup a real one to do it myself "for fun."

However, if you (or anyone else!) does have a Delta lake instance and you know it should be configured as a dbt-duckdb plugin, I would most definitely be happy to merge it in.

milicevica23 commented 9 months ago

Hi, @jwills, I would like to try this integration. This would be my first contribution, so I would appreciate some help and guidance at the beginning.

I did a first draft of read plugin integration here
and doing parallel an example project here where i showcase it

Here is the source configuration which loads data as the source with file and projection prunning

What workflow works best for you that you are able to give a feedback?

jwills commented 9 months ago

Hey @milicevica23, thanks so much for taking a crack at this!

The code as-written makes sense to me, but I have to be honest that I don't have a great sense for how folks actually use the deltalake python module in the real world-- like, do folks really use delta tables w/o a catalog? https://delta-io.github.io/delta-rs/python/usage.html#loading-a-delta-table

milicevica23 commented 9 months ago

The nice thing is that you can but should not use a catalog to know where your table is and i thought to implement support for both ways. Or at least try to do it.. You can think of that as that we add a new file format to external files and not everybody who is on prem or doing simple projects have catalogs. But would be happy to hear feedback from others

Yacobolo commented 9 months ago

Same here, the main use case is not the catalog, but more the metadata it generates together with the ACID transactions and time travel / change history🔥

jwills commented 9 months ago

Alright, super cool. So @milicevica23 if you would put your change together as a PR and other folks on this thread can weigh in on any additional config options we need to support those use cases, that would be great!

milicevica23 commented 9 months ago

Sure, i will open an draft PR.

The things still to do

Be free to add new ideas, topics

I am not used to PR process in the github so feel free to rewrite, do stuff as it fits the needs and best practices

geoHeil commented 1 day ago

How would https://duckdb.org/2024/06/10/delta.html the new delta kernel work here to simplify and perhaps make the access to delta based data more performant?

geoHeil commented 1 day ago

A: https://duckdb.org/docs/extensions/delta#supported-duckdb-versions-and-platforms simply adding the extension (if the platform is supported)