dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.63k stars 1.59k forks source link

Support for deference to models imported from a package. #3309

Closed randypitcherii closed 2 years ago

randypitcherii commented 3 years ago

Describe the feature

Today, importing dbt packages works great for things that do not get materialized - so macros, custom materializations, analyses, things like that.

And even things that get materialized works well when importing models that are defined by some external package but don't already exist in your target warehouse.

However, consider the use case of a large company with a data foundation team that supports embedded analytics teams in other business units. The embedded teams cannot effectively import the models from the data foundation teams' dbt project without also rematerializing these models in the shared target warehouse.

In other words, it's not possible today to import packages with models that are already materialized in your target warehouse without rematerializing those models.

To address this, I think it'd be cool if there was a way (similar to slimCI) where imported packages could be associated with a run artifact that allows a dbt project to defer to those materialized models without rebuilding them. This is just one thought - I think there is probably a wide selection of ways to address this.

Describe alternatives you've considered

I believe there is some clever overriding you can do to the ref function to point to pre-configured locations when referencing an imported model. I haven't tried to make this work but I believe even with such logic, it'd be a pain to maintain information about where these models exist in the warehouse if things were to change in the imported project. This is a painful coordination problem.

The most obvious alternative is to not let companies break dbt projects into separate repos unless they have entirely independent lineages. This means macros can be shared really easily but model definitions must stay isolated.

Lastly, you could redefine the materialized models as sources in the downstream project. This would break lineage documentation across the entire pipeline.

Additional context

I think this is such a hard problem. I think many other package management systems don't have to worry so much about this because they largely import functionality (like macros) rather than definitions of potentially-existing entities for the purpose of creating these entities only if they don't exist in an arbitrary location - it's really tough!

Who will this benefit?

This will bet any organization that does all of the following:

In more human language - this will benefit typically larger companies.

Are you interested in contributing this feature?

I think I'm about 2 orders of magnitude too dumb to help much here hahahaha, but of course I'd love to.

jtcohen6 commented 3 years ago

@randypitcherii and I had a very cool conversation about this offline, and I want to summarize some of what we discussed here. The words here can be a bit confusing, and the possibilities are quite exciting.

Let's take as our starting point the functionality that exists today, and the use case of a large company with a Data Foundation team supporting embedded analytics teams in other business units. An analyst in an embedded teams could "import" the foundational package (fnd_pkg) in a powerful two-pronged way:

  1. Add fnd_pkg to packages.yml and run dbt deps
  2. Grab the artifacts (manifest.json) from the last time that the Foundation team ran the models in fnd_pkg

Here's the kicker: Once they've done that, they can accomplish exactly what Randy outlined above by running:

$ dbt run --exclude fnd_pkg --defer --state path/to/foundation/run/artifact

And that's it. dbt knows about the Foundation team's models because fnd_pkg is imported as a package dependency; this run will exclude all of those models, it will look to the Foundation package's manifest for the in-warehouse location of those models; and it will rewrite (defer) references from the downstream package to the upstream package to select from them exactly where the Foundation team put them.

In two crucial ways, this approach is preferable to redefining the foundation package models via sources because:

This is all possible today. It's actually one of the use cases we imagined when originally shaping defer functionality (#2527).

Future art?

We could consider making this syntax slicker by turning defer and/or exclude into node configs. That is, rather than needing to specify --exclude fnd_pkg --defer every time, a model (or entire package of models) could be set to always exclude, or to always defer its reference to the state manifest (if available). This feels like a nice-to-have, for now.

It isn't currently possible to defer to / compare state against more than one manifest. If there are many foundational packages, all of which want to be imported-and-deferred-to, it would be amazing if dbt could read from multiple manifests to compare state.

In the meantime, it seems like there are two reasonable options:

Lastly, would we consider formally adding a state key in packages.yml? In this workflow, the location of the package's most recent manifest is almost as important as the location of its code (git repository). Perhaps, to accomplish what's suggested above (multiple state inputs, each tied to an upstream package)—so long as it's a local file path. For reasons I outlined in #3159, I'm really hesitant about adding and maintaining tons of logic within dbt that's specific to how to pull and push files from cloud storage vendors. That's what deployment scripts are for, and Airflow, and dbt Cloud :)

github-actions[bot] commented 2 years ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.