[Feature] dbt should know about joins

Is there an existing feature request for this?

[X] I have searched the existing issues

Describe the Feature

When deriving core dim/fact models it's often the case that multiple intermediate tables are created to derive specific feature columns of the final table. For example, stg_users may have int_user_feature_1 and int_user_feature_2 which join together with stg_user to produce dim_user.

It is very tempting to define a unifying model like dim_user earlier in the DAG, ie as int_user, so that further feature work can benefit from a one-stop-shop of columns instead of having to bring them in "a la carte". Accessory unifying models are a bad practice, however because they create a bottleneck in the DAG, which slows run time, increases the risk of cyclical references by creating many unnecessary dependencies. and ultimately cannot add to itself which all lead to future un-unifying efforts down the line.

The best practice is for a dbt developer working on int_user_feature_3 - which references features 1, and 2 - to copy-paste the joins contained in dim_user rather than referencing it, in order to keep the DAG nice and orderly.

It would be convenient if dbt could be told of the joins in dim_user, and compile them as needed to serve references earlier in the DAG without creating an actual node. I'll call this abstraction abs_user. Looker users will note that this is how Looker Explores work. A predefined pattern of joins presents a set of selectable things to a user, and SQL is compiled based on what they select to produce the minimum query required for the output. In this case I'm thinking smaller than a Looker explore which can have weird granularity interactions, and just thinking about a table like dim_user with multiple sources of additional columns at the same granularity.

This solves the two problems of real unifying model node in the DAG. First, abs_user would not be a bottleneck because it's not a thing that has to itself be run before its dependencies. It would actually automate the known best practices for referencing pre-defined features and maximizing parallelization. And second, abs_user could build on itself; it could both be referenced in int_user_feature_3 and know that it's possible to join int_user_feature_3 to stg_user. If you tried to do such a self-reference, the compiled model code would contain a self reference (or detectable cycle) and fail.

Further Thoughts

It would likely be that dim_user is the DAG node produced by SELECT * FROM {{ ref('abs_user').

Someone at dbt knows much more than me about the challenge of parsing SQL queries to recognize the source join of selected columns. It sounds like this is one of the benefits of Malloy which is said to "Produce rich metadata [including] the originating column...". While it would be cool if the whole 'macro' was written in SQL and dbt parsed it, I could see a YAML based version more like LookML for Explores being great too.

The task of deifning minimum ingredient CTEs for a new model is nicely captured by dbt's "import" style to refs. However that style just tells developers where to write the code rather than doing it for them, and the task is generally very well defined.

If I knew more about other web development I could probably name an example of something that imports only the parts of a library actually used by a file.

Describe alternatives you've considered

A macro could be written today that takes a list of columns and, through basic hardcoded branching logic, determines a minimal subquery to access those columns.

dbt's "import" approach to refs defines a clear space in the file for developers to do this task, but it doesn't help them do it. The suggested feature could be used inline, and generate a subquery, or it could be used within a "import" CTE to produce more readable compiled code.

Who will this benefit?

Analytics Engineers

If joins can be considered business logic, then this feature would DRY up considerable business logic in the repo. It would also make doing right by the DAG and writing convenient SQL the same thing, which is rare!

Analysts Doing Analytics Engineering Work

The DAG can be an unintuitive concern for analysts used to writing analytical SQL. This feature lets them write SQL like they would with dim/fact models without concerning themselves with an underlying lineage and future cycle risks.

Refactor-ers

Refactoring a dbt repo with poorly structured references can be a lot of work. Wokring through 10 models using dim_user and trying to peel back which columns they actually need, and whether they create cycles, could be very challenging. This feature would allow the step to be carried out once in a central non-model, and distributed as a reference.

Are you interested in contributing this feature?

Anything else?

stg/int/dim/fact https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355

"Imports" Style https://discourse.getdbt.com/t/why-the-fishtown-sql-style-guide-uses-so-many-ctes/1091

Hi @jaysobel-drizly, what you are describing here makes me think about calcite's lattices :

https://calcite.apache.org/docs/lattice.html

Although it's used to virtualize a star/snowflake schema, it could be applied to your use case, as a many-to-one relationship can be narrowed to one-to-one. But I may be totally wrong.

note for consideration wrt implementation:

in a world with column-level dependency tracking, this might become moot - if FK relationships are established early in a project, and column parsing is available, lattices among downstream objects can be derived. So if that is an "intended" future feature, it probably would be good to think of implementations that would not conflict too hard with that. My feeling is that PK/FK relationships could therefore form the backbone of any implementation.

Bonus points: defining actual key constraints (and enforcing them automatically post-build) provides strong hints to the optimizer about how much memory will be required to execute a join and can thus enhance performance, so i think leaning into that is a good idea.

Hi All,

From my opinion, If you look at it from the fact/dim star schema perspective, why not defining your PK/FK where you need it? So in the end of your pipeline? I think that is an easier start, than requiring inheritance.

Most of the int_ (intermediate) models are staging or intermediate models, that are just for realizing the required transformations.

These PK/FK relationships would be useful in your end user facing layers such as

your core dwh

For your core dwh is to make sure your data integrity is fine, you can already use tests to make sure all referential integrity is fine. And of course, if you could define PK/FK here, it would help reduce the manual tests you need to write.

your end-user facing marts End user facing marts, with PK/FK, would also benefit from data integrity, but especially the new Metrics feature will benefit. If you have worked with Fact/Dim, you know that Metrics are calculated over Fact measures and Dimensions are in related tables. If you would want to utilize this Metrics feature well and use it possibly with downstream tools/frameworks such as Cube JS, defining dimensions that are not in the current model is required.

To be a bit more concrete, I think this Relationships would live in the .yml file that is in the folder of each layer. Take again the

core_dwh —- core_dwh.yml
data_marts —- data_marts.yml

In each of these YML files you would describe the relationships, that would have the following functions:

referential integrity when you want to enforce referential integrity, you can add a boolean, to make dbt enforce referential integrity, so creating tests that make sure your references make sense
feed the Metrics feature when you are defining metrics on top of a model such as fact_sales, you should be able to use dimension fields from dim_customer by putting in:

` dimensions:

model: ref(“dim_customers”)
- columns: plan country

` Dbt Metrics would take the relationship definitions from the .yml file, to be able to compile and build the join.

I'm way overdue for responding to this issue. Thanks for opening, and for the great comments!

There are a lot of really neat ideas in here, and I don't have too much to add myself. I'd just add that these two things feel quite different:

Joins within transformation DAG: In a world where dbt can read and understand SQL, it could understand which columns are being used to join tables together—and whether those same joins are being repeated over and over again. It's sort of possible to do this today, with model-level lineage, as a measure of "DAG modularity." The use case here is really around
Joins within the "semantic layer": As dbt evolves into providing a real semantic layer, in the form of metrics, it does feel important that it gains an intrinsic understanding of how "final" models join together. This has overlap with other thinking about exposures (#3533), ERDs (https://github.com/dbt-labs/dbt-docs/issues/84), and of course the future of metrics, which don't support joins today.

We're not in a place where we'll be ready to write code for any of this very soon. I think this will be living rent-stabilized in my mind for the next year+. For now, I'm going to convert this to a real discussion, and invite more folks to join in :)

dbt-labs / dbt-core