dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.92k stars 1.63k forks source link

[Feature] Support for `clone` as a model materialization type #10896

Open johnson-jay-l opened 2 weeks ago

johnson-jay-l commented 2 weeks ago

Is this your first time submitting a feature request?

Describe the feature

We currently "gate" models behind tests by adding a downstream model that does a full table copy via table materialization. This worked at first but has not scaled and gets expensive on pay-as-you-go compute.

Instead of using a table materialization, we would like to add a clone materialization that does a databricks shallow clone on the upstream table and adds it to a different schema with the same table name. This will be a very fast operation and will not have an expensive I/O hit.

It should only work with upstream models that are supported by shallow clone (not views, for example)

clone models should run just like any other model, including support for the graph operators, hooks, custom schema macro, custom database macro, etc.

There should not be any dependencies on dbt state artifacts in order to run a model with a clone materialization (unlike databricks-dbt's current implementation of the dbt clone cli linked below)

Describe alternatives you've considered

Who will this benefit?

Are you interested in contributing this feature?

Yes

Anything else?

thomasantonakis commented 2 weeks ago

This is a great feature request. I have been trying to fiure out a way to wait until all tests have passed, before a table is cloned to production without extra cost.

What we have implemented so far is to get the result we want is to add a custom schema for the most important models of a model folder. We run our tests in our custom schema during the build command. When everything has finished successfully, we run a macro to clone the tables from the custom schema to the production schema with a macro, that is called repeatedly. This logic only exists on Airflow and not on the dbt repo, so it's difficult to remember and debug in case a test fails. I have this in my backlog since July 2023, to refactor the usage of macros after the flow has finished, and rather use post-hooks. A new type of materialization which could clone the previous model could solve the problem, although it would create yet another model in the model folder.

With our implementation we make sure we never show into production untested tables, as we prefer leaving them outdated.

Any thoughts?

johnson-jay-l commented 1 week ago

@thomasantonakis I've thought about that too. We also use Airflow and do a lot of custom parsing of the dbt manifest to hook in other functionality that dbt does not support natively. However I would love to get away from those patterns and use more native dbt solutions. I'm okay with the overhead of another model because it reduces complexity in our custom airflow logic.

I got a POC pretty close to working here but it feels a bit hacky. The model.sql would call the clone() macro instead of adding the config() block. With some guidance I don't think it would be a big lift to add a proper clone materialization for use by database backends that support cloning.

I am open to contributing to this some more but will need guidance. The dbt-databricks folks suggested that I open a feature request with dbt-core

johnson-jay-l commented 1 week ago

This dbt-databricks code is very close to what we need but it only runs from the CLI command dbt clone. It would need changes so that it can run as a proper model materialization type.

https://github.com/databricks/dbt-databricks/blob/main/dbt/include/databricks/macros/materializations/clone.sql

seub commented 1 week ago

I also think this is a good feature request. Note that if you want this materialization to be available for any adapter, you'll have to check what zero-copy clone feature each data warehouse offers.

If it's too ambitious to get this approved as a new official dbt materialization, you could instead write a dbt package that provides this materialization.

I have written a dbt package that provides a materialization "acid_clone" that does something like this for Snowflake. In our case, it's important that the clones are created only at the end of the run, if the whole run was successful. The point is to make the dbt run "ACID", as far as these final tables are concerned. We create all the clones in a single Snowflake transaction. The package is not public but I'd be happy to share more information.

johnson-jay-l commented 1 week ago

I noticed a clone.sql in the dbt-adapters repo already but it looks to have the same problem as the dbt-databricks version here.

Why was it implemented with a dependency on the state file? Most other materializations do not have that dependency. Could the object type (table/view/clone) be determined at runtime instead of via the state file?