dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
10.01k stars 1.63k forks source link

[CT-469] Combine --select/--exclude with --selector when used together? #5009

Open solomonshorser opened 2 years ago

solomonshorser commented 2 years ago

Is there an existing issue for this?

Current Behavior

I try to test my models like this:

% dbt test --selector staging --exclude stg_client

However, tests that apply to stg_client (a staging model) are still executed.

Expected Behavior

I expected the tests that relate to the model stg_client to be skipped, since I specified --exclude stg_client.

Steps To Reproduce

  1. Create a YAML selector that includes several models with tests.
  2. Try to test with the selector, but use --exclude to exclude a single model from testing.

Relevant log output

No response

Environment

- OS: Mac OS 11.6.5
- Python: 3.9.10
- dbt: 1.0.3

What database are you using dbt with?

bigquery

Additional Context

No response

jtcohen6 commented 2 years ago

Hey @solomonshorser, the short answer is that --selector takes precedence, and --exclude is ignored.

This has come up before: https://github.com/dbt-labs/docs.getdbt.com/issues/803 (originally transferred from this repo). The solution proposed then was to update our docs. It's there now, but tucked away in a note at the very bottom of https://docs.getdbt.com/reference/node-selection/syntax:

Note that when you're using --selector, most other flags (namely --select and --exclude) will be ignored.

This could definitely be clearer! Do you think the right answer looks like:

solomonshorser commented 2 years ago

@jtcohen6 I had only read the YAML Selectors page, so some clarification there would help. A warning message might be nice too, it would have saved me some frustration. Or maybe instead of a warning, an actual hard-stop error message: "Incompatible selection criteria were given..." or something like that.

Back to my original problem: I have a selector (not a simple one) that selects a number of models. I'd like to test everything covered by that selector, but there is one particular model that has some known testing issues. I'd like to be able to test all of the models except that problematic model (until other parties are able to either confirm certain changes to the test or the underlying data). Is there any way to do this besides modifying the YAML selector itself? This exclusion is not meant to be permanent, just something I want to do on an ad-hoc basis.

jtcohen6 commented 2 years ago

Makes total sense. https://github.com/dbt-labs/dbt-core/pull/4827 proposes to make possible the inheritance of one selector by another. That alone should provide for a more ergonomic approach to your situation.

A bigger question, one I'm not sure I feel ready to answer: Should passing --selector alongside --select/--exclude have the effect of inheriting the selector, and then applying the selection/exclusion logic? What about passing --select/--exclude at all when there's a selector defined with default: true? Historically, our approach has been:

solomonshorser commented 2 years ago

Should passing --selector alongside --select/--exclude have the effect of inheriting the selector, and then applying the selection/exclusion logic?

That would make sense to me, though I don't know if "inherit" is the word? I thought of it as the set of whatever was selected by --selector unioned with whatever is selected by --select then subtract whatever is excluded by --exclude.

Something like: dbt test --selector staging_models --select extra_models --exclude problematic_models I thought would have an effect like running on the set: staging_models UNION extra_models MINUS problematic_models

ghost commented 2 years ago

Just encountered a case where this would be very helpful functionality. Generally we'll want a standard selector, but being able to add an additional intersection, union, or exclude would be great so that we can tweak a selector for a particular execution dynamically.

stumelius commented 2 years ago

I also have a use case where I'd like to use selectors in both --select and --exclude arguments. Example where I run all the marketing models but exclude any ecommerce models in the marketing DAG:

dbt run --select selector:marketing --exclude selector:ecommerce
ghost commented 1 year ago

I think this would be a very nice behaviour to add indeed. Another interesting use case would be to build only models from a selector that passed tests on the previous run: dbt build --selector <my_selector> --exclude 1+result:fail --state ./target.

jtcohen6 commented 1 year ago

I've come around on this. Here's how I think it should work:

I'm going to mark this as help_wanted for any community member who'd be interested in working on it :)

acurtis-evi commented 1 year ago

I think it would be good if the following could be done

  1. Allow selectors to be part of select/exclude logic. This includes making selectors respect + for children and parents as well as other operators
  2. Allow a way to override one or more selectors in yml from command line. --select:selector_name, --exclude:selector_name
  3. Allow command line syntax to work in yml.

I believe this would be backwards compatible but also mostly future complete as in it provides intuitive ways to accomplish whatever selection is desired through a combination of predefined and command line options.

If the command line syntax worked identically in the yml, then it could also simplify the documentation and things would just work the way the end user expects them to work.

acurtis-evi commented 1 year ago

Also, happy to help on this. Curious if the select/exclude is represented in the same way internally as each selector?

acurtis-evi commented 1 year ago

https://github.com/dbt-labs/dbt-core/pull/7101

Starting to work out how this could be done in the PR above, have only partially tested

acurtis-evi commented 1 year ago

I believe that this is ready for review through #7101

acurtis-evi commented 1 year ago

I did things differently than what @jtcohen6 suggested.

Basically, --select / --exclude create a new selector named "arg_selector" which can be referenced in selectors.yml. The select / exclude syntax also supports selector:.

Realize it is different, but I think the change in the PR basically makes the change seem seemless.

Also, unified the syntax between the yaml selectors and the command line (introducing EXCEPT).

dbt --select "selector:selector_one EXCEPT some_model+"

is the same as

dbt --select selector:selector_one  --exclude some_model+

In addition, the selector yaml syntax allows selector:... and the union/intersect,EXCEPT logic.

definition: selector:selector_one EXCEPT some_model+
stumelius commented 1 year ago

@acurtis-evi This would work for the use case I described. Did I understand correctly that both

would work and are interchangeable after the change?

I also have a use case where I'd like to use selectors in both --select and --exclude arguments. Example where I run all the marketing models but exclude any ecommerce models in the marketing DAG:

dbt run --select selector:marketing --exclude selector:ecommerce
acurtis-evi commented 1 year ago

Yes, both options would work. In addition to this, the yaml selector syntax is unified with the command line syntax. The old yaml syntax is supported, but the plain description can simply be a complex string with parent, child relations, unions, intersections, and EXCEPT.

selectors:
name: some_selector
description: selector:marketing EXCEPT selector:ecommerce

and then you can do the dbt run using

dbt run --selector some_selector

I also introduced an arg_selector

dbt run --select selector:marketing --exclude selector:ecommerce

AND

dbt run --selector arg_selector --select selector:marketing --exclude selector:ecommerce

are identical and the arg_selector can be used in the yaml selectors like this

selectors:
name: some_selected_selector
description: selector:some_selector,selector:arg_selector
name: some_selector
description: selector:marketing EXCEPT selector:ecommerce

Now you can use that selector like

dbt run --selector some_selected_selector --select some_models+

and it will intersect selector:marketing EXCEPT selector:ecommerce and some_models+

The selector:some_selector allows for modifiers as well such as +1

dbt run --select selector:marketing+1 --exclude selector:ecommerce
ahrussell commented 1 year ago

@acurtis-evi thanks for the PR! This feature would definitely be helpful for a use-case that we have. Is there any new info on this issue?

Battiloni commented 8 months ago
# ENV
dbt-snowflake==1.7.1

Found this issue while looking for more info about the selector argument - here's an interesting edge case I have below:

Finally, when I'm running the following commands I don't have the same result and behaviour I would have excpected

dbt ls --select models/marts

Expected result from both above commands: returns only marts models from $my_project_name

I also think a combination of the arguments --select/--exclude with--selector should be implemented, there is a high risk users are not using this correctly and can wrongly select nodes from their project and installed packages

djbelknapaw commented 7 months ago

I'd love to see the ability to combine selectors + command-line selection. My use case is that I'm trying to define super-sets of models I want to run so I have a consistent definition - for example, "things I want to run hourly". Then I want to intersect that with a selector per job, for example, "data_product_x and its parents and children".

Defining additional selectors for each data product feels like overkill to me. Ideally I could run this with something like:

dbt ls --select selector:hourly,+data_product_x+

Any chance we might see at least the selector: method added? This issue's been around for a while as has the PR.

mroy-seedbox commented 7 months ago

@djbelknapaw: the way we're handling this right now is via YAML anchors. It's not as user friendly, but at least it works well for now.

Example:

  - name: data_product_x
    definition:
      union: &data_product_x
        - 'model1'
        - 'model2'
        - '...'
  - name: data_product_x_hourly
    definition:
      intersection:
        - union: *data_product_x
        - tag: hourly
indy-jonesy commented 7 months ago

This was a bit of surprise to find as an issue today. Would really appreciate if this could be resolved. We are using anchors as suggested, but its not clean. It adds overhead.

Further, it'd be great if these all didn't have to be in a single selectors.yml file, and could be broken out into separate files like schema.yml files can for maintainability.

mroy-seedbox commented 7 months ago

Agreed, multiple files would be great! (although that's probably a separate feature request)

Our current selector.yml is over 1000 lines....

It should be fairly simple to merge the selectors from multiple files into one single list in python (as long as all selectors are unique, otherwise raise an exception). 🤷‍♂️

mroy-seedbox commented 2 weeks ago

I think the most explicit and powerful way to blend CLI selection with YAML selectors would be the introduction of a new selector method in order to explicitly select a selector (😮‍💨).

For example, instead of dbt test --selector staging --exclude stg_client (from the OP), the syntax could be dbt test -s selector:staging --exclude stg_client, which would make the intention very explicit (i.e. select everything from the selector, and exclude stg_client).

Bonus: this would also support the entire power of CLI selection, such as unions & intersections (for example: dbt test --selector:staging,tag:critical selector:staging,tag:important, or even better: dbt test --selector:staging,selector:critical_and_important), and it would thereby address #10991 & #10596 (and probably others).

In fact, selector inheritance already exists (see also #1628, #4821, and #4827), which defines a selector method. But unfortunately, this method is not available on the command line (only in selectors.yml). 😞

Actually, this felt significant enough that I created a new feature request: #10992 (also since this one is pretty old and seems dead in the water).