dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.65k stars 1.6k forks source link

Review UX for --defer #2968

Closed jtcohen6 closed 2 years ago

jtcohen6 commented 3 years ago

We're making a subtle-yet-significant change to deferral in v0.19 (#2946, #2954), which is already a complex and under-appreciated feature. It's well and good to document those changes, but there are also things we can do within this codebase to make the feature more intuitive for users.

Naming

Should we still call this defer?

Here's what we're trying to get across:

What's good about defer?

What's bad about defer?

Alternative metaphors:

Any of those do anything for anyone?

Developer experience

How can we make it clear to users which models/resources have been deferred, and which haven't?

@jtcohen6: We currently have a debug log that lists the number of resources being deferred, and a sample of up to 5 (though the current wording is Merged {x} items from state). We could log that to stdout instead. We could also take it one step further, and determine which deferred resources are actually relevant to (upstream of) the models or tests being run. That would take some extra work, but it feels worthwhile:

$ dbt run -m model_a model_b
Running with dbt=0.19.0
Found 7 models, 4 tests, 1 snapshot, 0 analyses, 138 macros, 0 operations, 4 seed files, 1 source

DEFERRED 1 upstream relation: analytics.model_a

21:52:09 | Concurrency: 1 threads (target='dev')
21:52:09 |
21:52:09 | 1 of 1 START view model dbt_jcohen.model_b........................... [RUN]
21:52:09 | 1 of 1 OK created view model dbt_jcohen.model_b...................... [CREATE VIEW in 0.06s]
21:52:09 |
21:52:09 | Finished running 1 view model in 0.32s.

@drewbanin: I'm almost picturing that those nodes show up in the stdout logs as though we were running them, but they have a status like DEFERRED or UPSTREAM:

$ dbt run -m model_a model_b
Running with dbt=0.19.0
Found 7 models, 4 tests, 1 snapshot, 0 analyses, 138 macros, 0 operations, 4 seed files, 1 source

21:52:09 | Concurrency: 1 threads (target='dev')
21:52:09 |
21:52:09 | 1 of 2 SKIP relation dbt_jcohen.model_a.............................. [DEFERRED]
21:52:09 | 2 of 2 START view model dbt_jcohen.model_b........................... [RUN]
21:52:09 | 2 of 2 OK created view model dbt_jcohen.model_b...................... [CREATE VIEW in 0.06s]
21:52:09 |
21:52:09 | Finished running 1 view model in 0.32s.

We do store deferred as a node attribute in the manifest. Is there value in surfacing that information in the docs site? It's already sort of there, implicit in each model's compiled SQL, since deferred nodes will have rendered their references into a different namespace.

gshank commented 3 years ago

I like putting it in the standard out logs. None of the other terms proposed for it grabbed me. Most of them felt slightly more awkward and not more intuitively obvious. But I'll think about it.

jtcohen6 commented 3 years ago

UX

Naming

After some internal conversation:

Imagine, for instance (thanks @drewbanin):

dbt run --unbuilt-upstream=<MODE>, where <MODE> is one of:

Nothing final about --unbuilt-upstream as a name, though I agree that it does the right thing by optimizing for clarity.

IMO we should change the name at the same time we expand functionality. We can do that with backwards compatibility—i.e. --defer would map to --unbuilt-upstream=rewrite_refs, DBT_DEFER_TO_STATE to DBT_UNBUILT_UPSTREAM=rewrite_refs)—so it wouldn't be a breaking change.

tl;dr

All of the above is worthwhile, but none of it needs to happen before v0.19.0-rc1, so I'm going to remove this issue from the Kiyoshi Kuromiya milestone.

clausherther commented 3 years ago

Just my 2 cents on naming: I'm working on an in-house dbt deployment tool that basically wraps dbt run and dbt test with a bunch of options to make build and deployment workflows a bit easier. I'm wrapping the idea of using a different upstream source simply in an --upstream-source or -u parameter. You just supply the name of the target you want to use as your upstream source and we handle generation of manifests if needed and --defer and --state syntax for you.

panasenco commented 3 years ago

I like "defer" just fine personally.

ucg8j commented 2 years ago

Hello - this is a great issue thread, I love the engagement with the community on opinions re naming. Also, @jtcohen6 "a pleasure deferred" got a chuckle from me - also a helpful reminder I need to read more Dostoevsky - however I keep putting this off, a pleasure deferred!

On naming On our go implementation of the dbt cli, we have the following:

--upstream=prod or -u prod

We also built an in-house version of this to extend the dbt cli prior to the defer feature existing. Referred to in this blog. We invoke a model run, for example, we can read from ‘prod’ data, by using dbt monzo upstream prod -m modelA. ‘prod’ can be any target, so you could point it to another developers dev area if you were collaborating on different PRs. Theo, who built this a few years back, posted on the dbt community page more details on how it works.

Here's the CLI print out from our upstream

👷‍♀️  These models will be run in your dev dataset:

    -  modelC

🔄  These models will be substituted with data from the target 'prod

    -  modelA
    -  modelB

In short - my vote would be for –upstream with a nice short -u option too.

On behaviour I wonder whether this should be the default behaviour? Or at least configurable to be the default behaviour at the target level.

When I was owning pipelines, I found myself most of the time using upstream=prod. And at a previous data platform company I worked at, the default behaviour was to read from production data as data developers in most cases want to know what their pipeline modifications will look like production.

ran-eh commented 2 years ago

This may be a bit of a leap, but bare with me. I keep hoping for dbt to adopt the Make metaphor, and this may be the opportunity to start on that path. I expect this was considered in the past, but let's revisit this pandora's box.

Forget defer for a moment: imagine a dev cycle where you change the source file for a bunch of models/tests/etc, and then dbt make identifies and runs only the models/tests you changed, and possibly their downstream dependents. Rinse, repeat, until you are ready to commit your feature.

Make opens the door to a tone of useful features: you configure a model to be considered out-of-date if it has not run for 12 hours, and run make only runs out-of-date models and their dependents. Now instead of having a daily job and an hourly job and a job that runs every other Wednesday at 3:22 AM, you configure the models themselves for how frequently they need to be updated, and a single dbt make job takes care of everything for you.

In fact, if you properly define out-of-dateness for sources, you may never need to run a model that is actually current, just because it happens to be an upstream dependency of the model you need.

Defer is a natural extension of this: you define your baseline and development profiles in profiles.yml and do

dbt make --baseline-profile=my-prod --target-profile=my-test

This obviously brings up a ton of questions about how to maintain state, but we may be at the right time to break the taboo on out-of-database state, and go into the weeds on this one. Once this is figured out you have a much more natural way to think about data dependencies.

A difficulty in naming a feature sometimes signals that a rethink of underlying conceptual frameworks is due, and this may be the case here. Thoughts? @drewbanin ?

P.S. under the current framework, I add my vote for --upstream replacing --defer.