Open gui-elastic opened 8 months ago
Hi @gui-elastic, What you explained above I already somehow described here https://github.com/duckdb/dbt-duckdb/pull/284#issuecomment-1914351933, and this is the reason why I have started to implement refactoring; the problem there is that we have to make sure that we don't break the current process and there are also some open questions which I still have to figure out.
Saying that, in the next few days, I will try to fill the gaps for the refactoring and am waiting for the feedback on the general code flow. The implementation of the Iceberg afterwards should be straightforward because you will get the arrow table / record batch directly into the store function of the plugin
Happy to hear your feedback
Hey @milicevica23
I simply loved the idea. I already think that dbt-duckdb is an amazing project, but with this improvement, it will be on another level, being used for Data Lakehouse architectures, and interacting with Delta and Iceberg tables (reading and writing).
When this refactoring is merged, please let me know, I will be glad to test it. If needed, also on writing custom plugins.
Yes, I think that too, and I believe that this improvement will bring a bunch of new use cases that can be done where everything that speaks arrow can be integrated Maybe you would be interested in our blog, where we describe how we “futuristically” imagine the possible direction
https://georgheiler.com/2023/12/11/dagster-dbt-duckdb-as-new-local-mds/
I would encourage you to subscribe to the refactoring pull request and look into the code; I am happy to chat/discuss it. You can find me in dbt slack
Thank you so much!
Just to confirm, the refactoring PR is this one https://github.com/duckdb/dbt-duckdb/pull/332, correct?
I will take a look at the blog post right now. Thx!
Hey @milicevica23! Any update on this? I see that the PR has been stale/draft for some time. Is there any way to advance and push this forward? Thanks for the amazing work :)
Hi @MRocholl, I can say that I don't work on this feature right now because I am swamped privately. When I was doing this refactoring, I didn't have time to go over and comprehend all the options and breaking changes produced by this pull request and guarantee that all the use cases would work as expected. I am not sure how to proceed here. Maybe I could take a look again from the time management perspective, but we have to find a proper way to ensure that no different breaking changes are produced, and this is a real challenge. The points and status should be already documented in the PR so there is no more to add from that side I am happy for suggestions and discuss how to go further cc @jwills
Yeah I think the ideal here is always to rely on DuckDB + the extension to do this reading/writing itself as much as possible vs. having dbt-duckdb do it (and in the process turning into its own sort of data catalog-type thing, which is really not what I was going for when I started down this path, but here we are.) This pattern seems to work well for e.g. Postgres and MySQL via the ATTACH
functionality and I'm hopeful that we will have the same support in place over time for external systems like Iceberg and Delta.
Just like @milicevica23, I'm super busy with the actual job I am paid to do (which unfortunately doesn't involve all that much DuckDB.)
Thank you both for the fast reply. As @jwills said, I believe a lot can be done already with the extensions that duckdb ships by itself by hooking a post-hook and using copy statements or by using the attach functionality. I might take a jab at the iceberg plugin via pyiceberg in a PR, but will have to see if I can make time. The easiest way would be to wait for duckdb team to come up with extensions for all of the options eventually. Thank you anyways @milicevica23 @jwills for the work you already put into this.
Hello,
Recently, the pyiceberg 0.6.0 version was released which allows writing iceberg tables without needing tools like Spark and Trino.
I was about to write a custom plugin to implement the writing feature, however, I see that when using the external materialization with a custom plugin, first the outputted data is stored locally and then is read and ingested to the final source, however for Iceberg and Delta is does not seem to be a good solution. Would be good instead of storing the data on disk, simply load an Arrow Dataframe and then write to the final destination (e.g., s3 in Iceberg format).
I saw this thread: https://github.com/duckdb/dbt-duckdb/pull/332#issuecomment-1963017721, so I would like to ask you if there is any ETA to implement this feature. It would be an amazing feature to even use for production workloads with a Data Lakehouse architecture.
This explains well what needs to be fixed to use the iceberg writer in the best way possible: https://github.com/duckdb/dbt-duckdb/pull/284#issuecomment-1914351933