Closed jaysobel closed 2 years ago
Hi @jaysobel-drizly, what you are describing here makes me think about calcite's lattices :
https://calcite.apache.org/docs/lattice.html
Although it's used to virtualize a star/snowflake schema, it could be applied to your use case, as a many-to-one relationship can be narrowed to one-to-one. But I may be totally wrong.
note for consideration wrt implementation:
in a world with column-level dependency tracking, this might become moot - if FK relationships are established early in a project, and column parsing is available, lattices among downstream objects can be derived. So if that is an "intended" future feature, it probably would be good to think of implementations that would not conflict too hard with that. My feeling is that PK/FK relationships could therefore form the backbone of any implementation.
Bonus points: defining actual key constraints (and enforcing them automatically post-build) provides strong hints to the optimizer about how much memory will be required to execute a join and can thus enhance performance, so i think leaning into that is a good idea.
Hi All,
From my opinion, If you look at it from the fact/dim star schema perspective, why not defining your PK/FK where you need it? So in the end of your pipeline? I think that is an easier start, than requiring inheritance.
Most of the int_ (intermediate) models are staging or intermediate models, that are just for realizing the required transformations.
These PK/FK relationships would be useful in your end user facing layers such as
For your core dwh is to make sure your data integrity is fine, you can already use tests to make sure all referential integrity is fine. And of course, if you could define PK/FK here, it would help reduce the manual tests you need to write.
To be a bit more concrete, I think this Relationships would live in the .yml file that is in the folder of each layer. Take again the
In each of these YML files you would describe the relationships, that would have the following functions:
` dimensions:
` Dbt Metrics would take the relationship definitions from the .yml file, to be able to compile and build the join.
I'm way overdue for responding to this issue. Thanks for opening, and for the great comments!
There are a lot of really neat ideas in here, and I don't have too much to add myself. I'd just add that these two things feel quite different:
metrics
, it does feel important that it gains an intrinsic understanding of how "final" models join together. This has overlap with other thinking about exposures (#3533), ERDs (https://github.com/dbt-labs/dbt-docs/issues/84), and of course the future of metrics, which don't support joins today.We're not in a place where we'll be ready to write code for any of this very soon. I think this will be living rent-stabilized in my mind for the next year+. For now, I'm going to convert this to a real discussion, and invite more folks to join in :)
Is there an existing feature request for this?
Describe the Feature
When deriving core dim/fact models it's often the case that multiple intermediate tables are created to derive specific feature columns of the final table. For example,
stg_users
may haveint_user_feature_1
andint_user_feature_2
which join together withstg_user
to producedim_user
.It is very tempting to define a unifying model like
dim_user
earlier in the DAG, ie asint_user
, so that further feature work can benefit from a one-stop-shop of columns instead of having to bring them in "a la carte". Accessory unifying models are a bad practice, however because they create a bottleneck in the DAG, which slows run time, increases the risk of cyclical references by creating many unnecessary dependencies. and ultimately cannot add to itself which all lead to future un-unifying efforts down the line.The best practice is for a dbt developer working on
int_user_feature_3
- which references features 1, and 2 - to copy-paste the joins contained indim_user
rather than referencing it, in order to keep the DAG nice and orderly.It would be convenient if dbt could be told of the joins in
dim_user
, and compile them as needed to serve references earlier in the DAG without creating an actual node. I'll call this abstractionabs_user
. Looker users will note that this is how Looker Explores work. A predefined pattern of joins presents a set of selectable things to a user, and SQL is compiled based on what they select to produce the minimum query required for the output. In this case I'm thinking smaller than a Looker explore which can have weird granularity interactions, and just thinking about a table likedim_user
with multiple sources of additional columns at the same granularity.This solves the two problems of real unifying model node in the DAG. First,
abs_user
would not be a bottleneck because it's not a thing that has to itself be run before its dependencies. It would actually automate the known best practices for referencing pre-defined features and maximizing parallelization. And second,abs_user
could build on itself; it could both be referenced inint_user_feature_3
and know that it's possible to joinint_user_feature_3
tostg_user
. If you tried to do such a self-reference, the compiled model code would contain a self reference (or detectable cycle) and fail.Further Thoughts
It would likely be that
dim_user
is the DAG node produced bySELECT * FROM {{ ref('abs_user')
.Someone at dbt knows much more than me about the challenge of parsing SQL queries to recognize the source join of selected columns. It sounds like this is one of the benefits of Malloy which is said to "Produce rich metadata [including] the originating column...". While it would be cool if the whole 'macro' was written in SQL and dbt parsed it, I could see a YAML based version more like LookML for Explores being great too.
The task of deifning minimum ingredient CTEs for a new model is nicely captured by dbt's "import" style to refs. However that style just tells developers where to write the code rather than doing it for them, and the task is generally very well defined.
If I knew more about other web development I could probably name an example of something that imports only the parts of a library actually used by a file.
Describe alternatives you've considered
A macro could be written today that takes a list of columns and, through basic hardcoded branching logic, determines a minimal subquery to access those columns.
dbt's "import" approach to refs defines a clear space in the file for developers to do this task, but it doesn't help them do it. The suggested feature could be used inline, and generate a subquery, or it could be used within a "import" CTE to produce more readable compiled code.
Who will this benefit?
Analytics Engineers
If joins can be considered business logic, then this feature would DRY up considerable business logic in the repo. It would also make doing right by the DAG and writing convenient SQL the same thing, which is rare!
Analysts Doing Analytics Engineering Work
The DAG can be an unintuitive concern for analysts used to writing analytical SQL. This feature lets them write SQL like they would with dim/fact models without concerning themselves with an underlying lineage and future cycle risks.
Refactor-ers
Refactoring a dbt repo with poorly structured references can be a lot of work. Wokring through 10 models using
dim_user
and trying to peel back which columns they actually need, and whether they create cycles, could be very challenging. This feature would allow the step to be carried out once in a central non-model, and distributed as a reference.Are you interested in contributing this feature?
No
Anything else?
stg/int/dim/fact
https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355"Imports" Style https://discourse.getdbt.com/t/why-the-fishtown-sql-style-guide-uses-so-many-ctes/1091