dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.96k stars 1.63k forks source link

[Feature] Reflect macro dependencies on model nodes in `depends_on` in `manifest.json` #9982

Closed epapineau closed 5 days ago

epapineau commented 7 months ago

Is this a new bug in dbt-core?

Current Behavior

I am not 100% sure this is a 🐛, but it doesn't meet my assumptions of what I'd find in manifest.json:

If I have a macro, defined as:

{%- macro generate_model_sql() -%}
    select
        column_name_a,
        column_name_b
    from {{ ref('some_model') }}
{%- endmacro -%}

I would expect macros.macro.project.generate_model_sql.depends_on to contain: {'node': ['model.project.generate_model_sql']}. However, AFAICT, this depends_on only contains metadata about referenced macros and not referenced nodes / models.

Expected Behavior

Ref'd models to be included in the depends_on object

Steps To Reproduce

  1. Create a macro using ref
  2. dbt parse
  3. Inspect this macro's depends_on value in manifest.json

Relevant log output

No response

Environment

- OS:
- Python:
- dbt: 1.7.5

Which database adapter are you using with dbt?

bigquery

Additional Context

No response

dbeatty10 commented 7 months ago

Heya @epapineau 👋

This looks the same to me as needing to force dependencies like this:

 -- depends_on: {{ ref('some_model') }}

 {{ generate_model_sql() }}

If you don't do the -- depends_on: {{ ref('some_model') }} bit, does your project display the expected lineage graph and run in the correct order?

epapineau commented 7 months ago

@dbeatty10 thanks for the suggestion! The issue is not lineage or ordering - that is working as expected. However, after adding the suggested --depends_on and executing dbt parse, the ref'd models are still not present in the depends_on object for the macro in manifest.json.

dbeatty10 commented 7 months ago

After touching base with @gshank, I can confirm that it's known behavior that we don't track model refs when they are in a macro. So I'm going to update this to be a feature request.

jtcohen6 commented 7 months ago

Correct!

Conceptually, this is a bit similar to "vars referenced in macros are resolved based on the scope (project) of the model calling the macro, rather than the scope (package) where the macro is defined":

epapineau commented 6 months ago

@jtcohen6 interesting. Is there an alternative recommendation to understand macro dependencies on models from manifest.json?

jtcohen6 commented 6 months ago

@epapineau Thinking aloud - could you add an ephemeral model which calls the macro, and then look at the ephemeral model's depends_on.nodes in manifest.json?

epapineau commented 6 months ago

That use case would work for one or two macros, but would quickly become unfeasible at the scale of 100+ macros 🤔

jtcohen6 commented 6 months ago

Fair! Okay, taking a step back — why not track model dependencies within macros?

Conceptually, macros are resources in your project, but they aren't actually nodes in the DAG. They're functions that return code, which can call (and depend on) other functions that return code — whereas depends_on.nodes describes actual edges within the DAG. A model is only directly depended on by another DAG node, not by the function called within that DAG node which returns code.

You're allowed to define a macro that references a model that doesn't exist:

{%- macro generate_bad_model_sql() -%}
    select
        column_name_a,
        column_name_b
    from {{ ref('does_not_exist') }}
{%- endmacro -%}

Until you call that macro within an actual model (DAG node), it doesn't pose a problem for constructing a viable DAG. As soon as you do, you'll see a Compilation Error.

It's also tricky to think about: How would we track model → macro dependencies when that macro takes an argument, to modify the reference based on some input or parse-time condition?

{%- macro generate_model_sql(model_name='some_model') -%}
    select
        column_name_a,
        column_name_b
    from {{ ref(model_name) }}
{%- endmacro -%}

I do see the value in tracking which macros depend on ref within their depends_on.macros. I don't think we support that for builtin ref right now (it's not a macro per se), but it does work for custom ref overrides. Not actually recommending this, just to illustrate how you might be able to programmatically determine which macros / how many macros directly reference ref:

{%- macro generate_model_sql() -%}
    select
        column_name_a,
        column_name_b
    from {{ my_project.ref('some_model') }}
{%- endmacro -%}

{%- macro custom_ref() -%}
    {{ return(builtins.ref(*varargs, **kwargs)) }}
{%- endmacro -%}

Then in manifest.json:

    "macro.my_project.generate_model_sql": {
      "name": "generate_model_sql",
      "resource_type": "macro",
      ...
      "depends_on": {
        "macros": [
          "macro.my_project.ref"
        ]
      },
github-actions[bot] commented 1 week ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 5 days ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.