refactoring-external-materialization

milicevica23 commented 7 months ago

Hi @jwills, I wanted to open a pull request to ask questions and potentially propose the refactoring.

I will try in the next few days to list all the requirements and showcase all the implementation of the current external materialization here: https://github.com/l-mds/demo-dbt-duckdb-external-materialization

I would also like to go through that list and implement the missing tests before I start to refactor to ensure that the API stays the same I would also have some questions and need support there.

jwills commented 7 months ago

Ah these are all good questions, will inline some comments shortly

milicevica23 commented 7 months ago

Hi @jwills, Thank you. I will try to add my comments and further questions and, in the next few days, make a first poc of how I think this should work. Please be critical

The main goal, for now, is to showcase the code structure, then we can go into details.

To simplify a bit, I will move out from the scope for now:

glue registration -> I don't have enough experience with the glue, so I have to check it out and will come to that afterward (it should work before merging)
arrow batch streaming -> so the first version will be with the arrow Table, which is the exact materialization, rather than exporting in the batches. see https://twitter.com/milicevica23/status/1754802477575078041 (important for delta)
assume that I don't reference a model in the different run so that functionality from "register_upstream_external_models" is not needed for now (it should work before merging)

milicevica23 commented 7 months ago

So, I added the structure and will start with the simple implementation of the native plugin, which will hopefully show if this makes sense.

This is still a work in progress, but If you have some comments or questions I would be happy to answer them.

I have an general question. Is there some concept of the release candidate / nightly that people can try this before they start to use it :D i am not sure if i can get all 1-to-1 api in the first run

jwills commented 7 months ago

So, I added the structure and will start with the simple implementation of the native plugin, which will hopefully show if this makes sense.

This is still a work in progress, but If you have some comments or questions I would be happy to answer them.

I have an general question. Is there some concept of the release candidate / nightly that people can try this before they start to use it :D i am not sure if i can get all 1-to-1 api in the first run

There isn't a nightly or some such thing as of yet; if we add it, we will add it as a net-new thing that is disabled by default (which is how we've done a bunch of other things, like e.g. the fast CSV loading for dbt seed)

milicevica23 commented 7 months ago

Hi @jwills, can you take a look. I did the first draft of the native functionality and a new test which should cover some of the general cases

milicevica23 commented 7 months ago

Hi @jwills, I think I finished the first round of refactoring, so maybe you could take a look and see if this makes sense to you. What I did:

refactored the external materialization so that the creation of the external tables is handled by the plugin
the native plugin is a new plugin which does the parquet, json, csv
adapted the Excel and sqlalchemy plugins to use new code flow.
tried to fulfill and make all test run that were in place without big change
adapter upstream macro to register needed tables and data frames

The new code flow provides a nice feature that the exported file can be pointed directly from the target destination. It sounds trivial for native format but it is especially important for the further implementations of the iceberg, delta, (big query, snowflake ?) adapters, which can now be implemented within external materializations. It also provides a very nice convenient way to register df in the upstream macro

What is there still to do: missing:

[ ] Glue integration -> currently, i skipped it but will take a look and would maybe need your help for this because I never used it
[ ] python part of the external materialization -> i don't know so much about this and didn't find tests with that so I am not sure how to go about this but will take a look
[ ] buenavista integration -> what is the status there? I have to see what should be supported and adapt to that
[ ] other general stuff that I overlooked?

has to do:

[x] native plugin has to be initialized on its own as default, and I have to find a place where to do it
[x] find a way that native plugin load doesnt go over the arrow format but directly
[ ] add more tests for native, excel, sqlalchemy and others -> we have to take use cases from the community in order to cover all cases (i need here your help to motivate people :D)
[ ] look into the upstream macro because i think that with the old implementation, not all cases were done correctly because there is the problem with the rendering -> will try to explain in the future but the https://stackoverflow.com/questions/22655031/jinja2-how-to-put-a-block-in-an-if-statement shows how the rendering works and i think that we didn't use it right
[x] showcase the delta plugin and write tests

nice to have:

maybe add some tests with azurite and minio to simulate object storage

I hope this gives you a bit overview of what i did, i am happy to answer all the questions

I still fight with the mypy and formater in the pre-commit if you have some suggestions how to go about this i would appreciate it a lot

jwills commented 7 months ago

@milicevica23 ah that is very cool, thanks for the update! I have some stuff going on this week but I will dive into this on Thursday!

jwills commented 7 months ago

@milicevica23 when you get a chance could you redo the formatting according to the pre-commit run --all-files setup? All of the formatting diffs make this hard to review

duckdb / dbt-duckdb

refactoring-external-materialization #332