dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.92k stars 1.63k forks source link

[CT-2599] [Feature] childrens_parents depth #7703

Open 0xRobin opened 1 year ago

0xRobin commented 1 year ago

Is this your first time submitting a feature request?

Describe the feature

Currently the @ graph operator does not allow for depth control in any way and will always include all children of a model and their (the childrens) parents. I would like to be able to specify the depth of the child selection in the same way the children/parent operators allow for depth specification.

example: dbt ls -s 1@my_model would result in the selection of the models children up to depth 1 + all the parents of those children. The same should be possible with a YAML selector definition

definition:
  method: fqn
  value: my_model
  childrens_parents: true
  children_depth: 1

Describe alternatives you've considered

I tried finding a workaround with complex yaml selector definitions but since we can't "layer" definitions (take all the 1 depth children of this model + then take all the 1 depth parents of all the models in the selection) I could not get it to work as I wanted.

Who will this benefit?

For our specific use case we want to run all seeds that are used in the tests of a specified model.

Currently we can achieve this with the @ operator and a resource-type filter but this also includes any seeds that are used in tests of any of the child models of the specified model, which are not necessary for us and increases run times.

Are you interested in contributing this feature?

sure, I haven't looked at the codebase structure but this feels like a good first issue.

Anything else?

Some extra context on where/why this would help us: https://github.com/duneanalytics/spellbook/pull/3408#issuecomment-1562445109

whisperstream commented 1 year ago

We also ran into this one for exactly the same use case. In our CICD we build all parents of changed_model and one level of child models. This gives us a good indicator if anything has broken downstream without having to build all children which, when you're using snowflake and have 1500+ models can make your CICD pipeline both slow and expensive.

The workaround we'll likely employ is to introduce some code that queries graph.pickle or uses dbt list -s changed_model+1 first in order to get the child models and then using that data, run dbt like this:

i.e.

dbt build -s +changed_model+1 +child_1 +child_2 ... +child_n

So not impossible to workaround but ugly and may have problems when the command line args get very big

Only suggestion I would have is to maybe use syntax @changed_model+1 for consistency.

jtcohen6 commented 1 year ago

@0xRobin @whisperstream Thanks for the idea, and for weighing in!

I feel like getting the syntax right will be the trickiest bit by far. Should 1@changed_model mean:

To @whisperstream's point, you can do this with a series of list commands, and more ergonomically if you're wrapping dbt-core's Python entry point (new in v1.5).

Although in CI, there should be no need to build all parents of changed_model if you're using --defer or clone (which will be new in v1.6!).

0xRobin commented 1 year ago

@jtcohen6 thanks for the additional info.

From my understanding @changed_model currently works as:

Which could also be expressed as +(changed_model+) as some kind of nested graph selection. Ideally we could control both the children and the parent depth. I think it's important here to follow the convention of having parent depth before the model and child depth after the model..

usage proposal:

whisperstream commented 1 year ago

I think @0xRobin proposed syntax seems straight-forward and intuitive.

foundinblank commented 1 year ago

One thing I've never actually been sure about is the current implementation of @changed_model - "the parents of the children" - does "the parents" mean only first-order parents or all (great)(grand)parents? This could be clarified with a small change to the documentation.

jtcohen6 commented 1 year ago

I think it's important here to follow the convention of having parent depth before the model and child depth after the model..

Agree!

The order of operations for resolving the @ operator: first we select the children, and second we select their parents. So it is a bit like reading from right to left.

I'm on board with the syntax that @0xRobin is proposing. And to @foundinblank's point, there's an opportunity to freshen up our docs on @. I'll tag this one a good first issue.


Likely starting point: Add parents_depth and children_depth arguments to select_childrens_parents, and pass them in accordingly.

https://github.com/dbt-labs/dbt-core/blob/b78d23f68d6e9124b74c07e0b101bed00c4171ed/core/dbt/graph/graph.py#L55-L57

https://github.com/dbt-labs/dbt-core/blob/b78d23f68d6e9124b74c07e0b101bed00c4171ed/core/dbt/graph/selector.py#L101-L120

And add some more tests:

graciegoheen commented 1 year ago

@0xRobin Hey! Just reading this part of the issue:

Currently we can achieve this with the @ operator and a resource-type filter but this also includes any seeds that are used in tests of any of the child models of the specified model, which are not necessary for us and increases run times.

Is this by any chance related to using one of the unit testing packages? I ask because we're working on building unit testing as native functionality in dbt-core. Would love to hear your thoughts if so, here's the discussion!

0xRobin commented 1 year ago

Is this by any chance related to using one of the unit testing packages?

Hi @graciegoheen, thanks for pointing out that interesting discussion. We're not using any external unit testing package. The tests we are running are a combination of seeds and a custom generic test to perform a seed equality check. Our problem arises from our setup where the seeds aren't materialized in production and are only materialized in our slim CI when they are modified. This results in the seeds not being accessible in the slim CI when you later modify the model but not the seed.