dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.8k stars 1.62k forks source link

Option to disable "skip downstream models on failure" #2142

Open ian-whitestone opened 4 years ago

ian-whitestone commented 4 years ago

Describe the feature

When a model run fails, all dependent, downstream model runs are skipped. This is a great default, but there can be some cases where you still want to run the downstream model.

For example, one case where I would actually want this default behaviour overridden is for jobs that rely on an infrequently updated dimension. Something like a country dimension or currency dimension which changes very rarely. Say you have a daily summary/rollup model that uses order transaction data and a country dimension. If country dimension starts failing i’d still want my model to keep running off the new order transaction data, and not have to wait until the responsible team fixes it.

Describe alternatives you've considered

Individually running models and keeping track of which ones ran successfully and which ones failed. This would be quite time consuming.

Who will this benefit?

Can't comment on how many people this will benefit, but the added flexibility is always a plus.

drewbanin commented 4 years ago

Thanks @ian-whitestone - the example use case you described is a really good one!

I think we can make this a model config. I'm picturing something a config key called on_error with values like continue, skip_children, and abort.

The values map onto the following behavior:

You buy this? I don't know that we want/need to include the abort parameter, but I thought it was worth discussing :)

ian-whitestone commented 4 years ago

👍 this makes sense to me. Only alternative I can think of is having this specified at the downstream (dependent) model level. i.e. same values but different meanings:

The advantage of this approach would be that it gives individual analysts/modellers the ability to specify how they want to handle errors for their specific model without dictating the response for everyone else.

bashyroger commented 4 years ago

What is the difference between:

Does abort do a ROLLBACK of the full tree (so, parents too?)

drewbanin commented 4 years ago

@bashyroger I'm not picturing abort doing any sort of rollback. I was instead thinking that it would compel dbt to skip all models that have not yet run, regardless of their relationship to the failed model. It's sort of a solution in search of a problem, and we probably shouldn't implement it until we have a better handle on some good uses cases :)

github-actions[bot] commented 2 years ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

cmdugan13 commented 2 years ago

@drewbanin I've been following this since the use case that @ian-whitestone would be useful to me as well. Curious if there has been any more work in this area. Thanks!

smitsrr commented 2 years ago

I would love this feature :-)

uttam17 commented 2 years ago

This would be very helpful

nick-heron-zip commented 2 years ago

+1 would be very helpful

jtcohen6 commented 2 years ago

Reopening this issue given the renewed interest!

Some of this is now possible with metadata, namely the result status that can select skipped models to rerun (docs). Think about this as, you can pick all nodes where both are true: "skipped" because of an upstream failure, and tagged to rerun_despite_upstream_failure.

dbt run --select result:skipped,rerun_despite_upstream_failure --state previous/run_results.json

I'd say, I'm open to this possibility, of conditionally disabling "skip downstream models on failure." It's a core construct of dbt-core's execution model, but I don't know that it's an absolutely essential component of the opinionated workflow, or of DAGs more generally.

Where to configure?

I think it feels appropriate to support this config on either the parent node ("I'm worth stopping the DAG for"), or the child node ("I don't care about upstream failures"). By picking one OR the other, it's much simpler to reason about, and much simpler to implement.

I'm hesitant to support defining this configuration for the full matrix of parent-child relationships, as I think that can get unwieldy very quickly. I think it might be possible to deploy ephemeral "middle models" that would be "passthroughs" for the upstream model's data, while allowing you to customize this behavior. Similar to the discussion about dbt build and test failures in https://github.com/dbt-labs/dbt-core/issues/4591. Unlike test failures, though, which access actual data, the ephemeral model would need some way to know that the upstream model errored. Query logs...?

How to configure?

There are two ways we could support this:

If we decide to support both, I'd see this being similar to how the --full-refresh flag / full_refresh model config work today (docs), where the model-level config (if set) takes precedence over the run-level flag.

How to implement?

For parents, that would look like adding a check here for result.node.config.get('on_error') == 'continue', before "marking dependent errors":

https://github.com/dbt-labs/dbt-core/blob/72c17c446440afe7d4b6501ed364effe8e14c9f2/core/dbt/task/runnable.py#L337-L342

For children, that would look like checking whether the "dependent node" has on_upstream_error: continue before marking it to skip:

https://github.com/dbt-labs/dbt-core/blob/72c17c446440afe7d4b6501ed364effe8e14c9f2/core/dbt/task/runnable.py#L399-L400

Next steps

I think there are still some open questions here—namely, whether parents or children are the right place to put this config. It'd be helpful to get more context on specific use cases, to motivate the choice!

Personally, I lean toward parents. This feels analogous to the "warning" behavior of tests, which say: I found some failures, which is good information to know, but it's not worth stopping the DAG for.

nick-heron-zip commented 2 years ago

Our context:

In our source layer we rename and recast all fields where necessary. One of the Salesforce tables (dimension) we ingest via Fivetran has 400+ columns. Less than 10 of these columns are used downstream in business-critical models. Recently one of the ~390 unused columns was dropped from the source schema, leading to a DAG failure and delayed business data while we fixed the error.

As a temporary solution we are no longer enforcing naming conventions or typecasting columns from Salesforce tables. Long-term we would like to trigger a warning instead of a failure when issues are detected in dimension tables.

jtcohen6 commented 2 years ago

@nick-heron-zip Thanks for sharing the use case!

My instinct would be to include only the columns in your staging model that you are actually using downstream, those which are actually relevant to your "analytical universe." I realize it doesn't say that in the docs or discourse explicitly, but that's always been, for me, a significant consideration when writing a staging model.

I also know that I've occasionally used staging model definitions as a form of documentation / quick look-up to see what's available in the source—but that's why you can document and describe columns in sources, and dbt docs generate produces catalog info (including full column list) for sources, too.

boje commented 2 years ago

Our context:

Our source is JSON and with Snowpipe these is loading into a table as the columntype VARIANT. In some cases loading data into a VARIANT is recommended by Snowflake as new schema changes won't break the load of data into Snowflake. When staging these data, the JSON elements is flattern out into a table with a number of column and datatype. If a column have been decided as int but now is sent as char the stage-model will break.

Another scenarie is when creating a merge statement in Snowflake it is require that the rows are unique otherwise this error is raise Duplicate row detected during DML action.

In both cases dataset X is returning an error and the dbt Cloud job will stop. Dataset Z, Y and Q is working perfectly but because X return error all models are blocked, even if there is reports only depending on Z and Y dataset.

Hence we are wondering how to avoid a all or nothing scenario as it is today. One option is to split the jobs into smaller jobs, but that will create dependency cross jobs and something that is not support in dbt Cloud today.

jens-koster commented 2 years ago

our use case: Some sources are google sheets of rarely changing data, edited by humans. Notoriously breaky due to typos. This data is not worth skipping the downstream models for. I would stage this source to a table, which would occasionally fail without affecting downstream loads. On failure the downstream loads would use the data I materialised in the previous run. That works well enough and I will catch the problem using freshness alerts on the table of materialised data. I can achieve this by running the staging of these tables in a separate step before the main run. Having this feature would make it a little neater and less work.

I can also see myself using this setting as a quick fix to get the load running again after failure, just until I can find the root cause and fix it properly. Excluding the model in the dbt run command means messing with the devops/build stuff, risking the entire load. Setting a config on one model is low impact and I can do it in my normal work flow of editing models.

davidsr2r commented 2 years ago

We're getting pretty desperate for this. There are all kinds of models we have where we pull in data from a variety of data providers, and one of those providers having out of date data (due to e.g. a schema breakage) doesn't invalidate the rest of the providers' data in the downstream models.

I'm about to have to break up our DAG purely to avoid this breakage, which nullifies a significant part of dbt's value, in DAG management.

jtcohen6 commented 2 years ago

@boje @jens-koster @davidsr2r I've marked this issue as help_wanted, meaning that we're not planning to prioritize it ourselves in the next few months, but we'd be happy to have a motivated community member work on contributing it.

Do you have thoughts on the questions I raised in my comment above, around how you'd want to configure this? Would you look for an all-or-nothing runtime config (--no-skip-on-failures), or a config you can specify in project code for specific upstream/parent models?

davidsr2r commented 2 years ago

@jtcohen6

We'd be needing a per-model config, so that the rerun of skipped models is intentional and explicit, and avoids downstream runs that shouldn't happen.

Re: Parents vs. Children, I'd say that it probably depends on your use case preferences; if it's on parents then you need to configure it in fewer places and it is information that is specific to that model, which makes a lot of sense, however if it's on children then the query that needs to be aware of the possibility of failure is in the same file that contains the config option, making the dependency and failure-case more explicit for the developer. I always prefer maintainability and so, while I could be persuaded otherwise, I'd therefore want us to use the latter option. Why not support both, or just one or the other?

Inviso-LiseBohr commented 1 year ago

@jtcohen6 We have a similar situation where we are waiting for the responsible team to fix an issue causing a stage model fail. This model contains very rarely changing data and therefore we can live without having it updated for some time. In the meantime the rest of our dag is skipped and none of our most important models are not updated until the issue is fixed.

In our situation it would make sense to have the configuration on the parent model, since a larger number of models references the failing model.

We would appreciate a feature described earlier in this thread on_error key and values like skip_children and continue Is there a good way to upvote the implementation of this?

Btw. our workaround for now is to just temporarily exclude the failing model from our dbt build, the downside is that it becomes a little manual and we need to remember to remove it again when the issue is fixed.

VDFaller commented 1 year ago

My only worry about putting the config in the child would be this use case

So C would only have data from B from the last time that A passed. I think this could possibly lead analysts (or more likely business side) to have bad trust in thinking that the data in C is up to date (even though it will be stale for the B portion), because of some loaded_date column being up to date.

pengyusong commented 1 year ago

our use case: our bi tool use serveral wide-table to show metrics, the wide-table composed of multiple tables, when one upsteam table result in wide-table willl failed, so the impact is significant , i want ignore some upsteam table failures, then wide-table can run success, then I have enough time to fix the abnormal upstream model

CitationJamespaul commented 11 months ago

We have a large project building our entire data warehouse - there are very few points of failure that would make us want to terminate our entire downstream build.

we would rather have downstream continuation by default and define our points of failure with tests.

id be happy with a --no-skip-on-failures option

maxrostron commented 6 months ago

Would still be very interested in the skip_children, abort, continue proposal and being able to specify that at the top of specific models - especially as we use dbt as part of dagster, so we don't always trigger dbt flows using command line prompts like --no-skip-on-failures.

Would be a game changer for our flow. At the moment we have to do a lot of convoluted workarounds.

jtcohen6 commented 5 months ago

I've left a comment on the linked PR:

After rereading the use cases described in the comments above (thank you everyone!), I'm interested in moving forward with this as a model-level config set on "parent" models, which are generally flakey failures caused by changes in unreliable upstream data sources. I don't believe we should support this on "child" models (at all), or as a global flag broadly applied (yet).

mahiki commented 4 weeks ago

I run analytics pipelines for reports, not data engineering pipelines.

If a tests fails (esp. anomaly detection tests) I want the whole pipeline to run. I'll get an alert about the anomaly or other failure and deal with it.

Two bad things about current behavior:

  1. I miss downstream errors that are unrelated to the first test failure. Yes there can be two unrelated failures and I want to see them both after a single build or run/test. I don't want to fix one thing, rebuild, find new error, fix, rebuild, etc multiple times.
  2. What I cannot have is a stack of empty reports with nothing in them for days while I work out the bug in one small part of the outputs. A test failure in my domain is something the business team would prefer to talk about rather than skip review of many blank pages of reports, or skip the business review altogether.

edit: Oh, I would just do a dbt run, and then dbt test in a different process that could handle failure in a customized way. Sorry, I just started using dbt and elementary. Carry on.