Open johnson-jay-l opened 2 weeks ago
This is a great feature request. I have been trying to fiure out a way to wait until all tests have passed, before a table is cloned to production without extra cost.
What we have implemented so far is to get the result we want is to add a custom schema for the most important models of a model folder. We run our tests in our custom schema during the build command. When everything has finished successfully, we run a macro to clone the tables from the custom schema to the production schema with a macro, that is called repeatedly. This logic only exists on Airflow and not on the dbt repo, so it's difficult to remember and debug in case a test fails. I have this in my backlog since July 2023, to refactor the usage of macros after the flow has finished, and rather use post-hooks. A new type of materialization which could clone the previous model could solve the problem, although it would create yet another model in the model folder.
With our implementation we make sure we never show into production untested tables, as we prefer leaving them outdated.
Any thoughts?
@thomasantonakis I've thought about that too. We also use Airflow and do a lot of custom parsing of the dbt manifest to hook in other functionality that dbt does not support natively. However I would love to get away from those patterns and use more native dbt solutions. I'm okay with the overhead of another model because it reduces complexity in our custom airflow logic.
I got a POC pretty close to working here but it feels a bit hacky. The model.sql would call the clone()
macro instead of adding the config()
block. With some guidance I don't think it would be a big lift to add a proper clone
materialization for use by database backends that support cloning.
I am open to contributing to this some more but will need guidance. The dbt-databricks
folks suggested that I open a feature request with dbt-core
This dbt-databricks
code is very close to what we need but it only runs from the CLI command dbt clone
. It would need changes so that it can run as a proper model materialization type.
I also think this is a good feature request. Note that if you want this materialization to be available for any adapter, you'll have to check what zero-copy clone feature each data warehouse offers.
If it's too ambitious to get this approved as a new official dbt materialization, you could instead write a dbt package that provides this materialization.
I have written a dbt package that provides a materialization "acid_clone" that does something like this for Snowflake. In our case, it's important that the clones are created only at the end of the run, if the whole run was successful. The point is to make the dbt run "ACID", as far as these final tables are concerned. We create all the clones in a single Snowflake transaction. The package is not public but I'd be happy to share more information.
I noticed a clone.sql in the dbt-adapters
repo already but it looks to have the same problem as the dbt-databricks version here.
Why was it implemented with a dependency on the state
file? Most other materializations do not have that dependency.
Could the object type (table/view/clone) be determined at runtime instead of via the state
file?
Is this your first time submitting a feature request?
Describe the feature
We currently "gate" models behind tests by adding a downstream model that does a full table copy via
table
materialization. This worked at first but has not scaled and gets expensive on pay-as-you-go compute.Instead of using a
table
materialization, we would like to add aclone
materialization that does a databricks shallow clone on the upstream table and adds it to a different schema with the same table name. This will be a very fast operation and will not have an expensive I/O hit.It should only work with upstream models that are supported by shallow clone (not views, for example)
clone
models should run just like any other model, including support for the graph operators, hooks, custom schema macro, custom database macro, etc.There should not be any dependencies on dbt state artifacts in order to run a model with a
clone
materialization (unlike databricks-dbt's current implementation of thedbt clone
cli linked below)Describe alternatives you've considered
clone
: not supported by models and only works from thedbt clone
cli command.Who will this benefit?
Are you interested in contributing this feature?
Yes
Anything else?
clone
materialization (does not work from a model): https://github.com/databricks/dbt-databricks/blob/main/dbt/include/databricks/macros/materializations/clone.sql