dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.62k stars 1.59k forks source link

[CT-1556] Smarter handling of `--vars` in partial parsing #6323

Open jtcohen6 opened 1 year ago

jtcohen6 commented 1 year ago

Just like https://github.com/dbt-labs/dbt-core/issues/3885, but for CLI --vars.

This would require us to capture, at parse time, which files depend on which --vars, via calls to the Jinja {{ var() }} function. That would also include macros that call var(), and are then called by models / other macros in turn.

For Python models, if we introduce a built-in dbt.var() function, we'd want to do the same. We're already doing something similar for configs, to power config.get() at runtime.

Whenever the --vars change, instead of triggering a full re-parse, we'd schedule just the files that depend on the var for re-parsing. Of course, if the var is used for a configuration within dbt_project.yml, that could still affect many many nodes.

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

ChenyuLInx commented 1 year ago

If possible, can we separate this handling from parsing? What I am thinking is that parsing is going from everything in project file -> a representation, then in actual runtime, we apply configuration to the representation and do the execution.

jtcohen6 commented 1 year ago

@ChenyuLInx Supportive of this line of thinking! The biggest caveats here is that vars can be used to dynamically disable/enable models, or to conditionally affect relationships between models — so it is necessary to resolve some vars during parsing in order to know the shape of the DAG, and to support node selection.

During parsing, we could store pointers to those variables, and then conditionally reevaluate them just before each execution. That feels similar to the approach described in this issue (partial parsing), though with some subtle differences in implementation.

gshank commented 1 year ago

Yeah, there's a difference between vars that are needed at parse time and vars can be resolved at compilation/execution time. Maybe we need some use cases to help think through the different situations. Vars in configs have to be resolved at parse time. Vars in plain sql could be delayed. I'm not sure how we could distinguish between them.

jtcohen6 commented 1 year ago

@gshank Do you know if the partial parsing manifest (target/partial_parse.msgpack) contains enough information (raw file contents & unrendered yaml configurations/attributions), such that we could support a re-parse when CLI --vars are supplied, without needing to go back to the actual file system?

I'm thinking:

github-actions[bot] commented 6 months ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

kp-tom-sc commented 1 month ago

We use the cli --vars to pass in airflow datetime variables. These change on each run, so we can't partial parse. Is there a better way of handling datetime variables? Can we have an ignorelist of some variable names (so that they don't trigger the partial parse) (or similiar to the secret env var ignore rule, some var prefix like VAR_NO_PARSE_my_datetime)