astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
https://astronomer.github.io/astronomer-cosmos/
Apache License 2.0
753 stars 170 forks source link

[Bug] dbt_vars raised "This can happen when calling a macro that does not exist" #1060

Open rightx2 opened 4 months ago

rightx2 commented 4 months ago

Astronomer Cosmos Version

Other Astronomer Cosmos version (please specify below)

If "Other Astronomer Cosmos version" selected, which one?

1.4.3

dbt-core version

1.7.16

Versions of dbt adapters

dbt-impala==1.4.3 (but i don't think this issue related with adapter)

LoadMode

DBT_LS

ExecutionMode

LOCAL

InvocationMode

None

airflow version

2.9.1

Operating System

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)" NAME="Debian GNU/Linux" VERSION_ID="12" VERSION="12 (bookworm)"

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

What happened?

dbt_vars can raise "This can happen when calling a macro that does not exist"

Relevant log output

'data_interval_end' is undefined. This can happen when calling a macro that does not exist. Check for typos and/or install package dependencies with "dbt deps".

How to reproduce

I need to pass variable, start_dt, to dbt_models with data_interval_end macro of airflow. Below is my configuration for cosmos dag and it worked like a charm.

project_config = ProjectConfig(
    dbt_project_path="/opt/airflow/dbt/data-dbt",
    project_name="data_dbt",
    partial_parse=True,
    dbt_vars={
        "start_dt": '{{ data_interval_end }}',
    },
)
profile_config = ProfileConfig(
    profile_name="data_dbt",
    target_name="dev",
    profiles_yml_filepath=Path("/opt/airflow/dbt/data-dbt", "profiles", "profiles.yml"),
)
render_config = RenderConfig(
    load_method=LoadMode.DBT_LS,
    emit_datasets=False,
    test_behavior=TestBehavior.AFTER_EACH,
    select=[
        "models/my_model.sql",
    ],
)
execution_config = ExecutionConfig(
    execution_mode=ExecutionMode.LOCAL,
)
operator_args = {
    "install_deps": True,
}

dbt_op = DbtTaskGroup(
    group_id="dbt_group",
    project_config=project_config,
    profile_config=profile_config,
    render_config=render_config,
    execution_config=execution_config,
    operator_args=operator_args,
)

However, when I tried to call function of macro like below, it raised error:

project_config = ProjectConfig(
    dbt_project_path="/opt/airflow/dbt/data-dbt",
    project_name="data_dbt",
    partial_parse=True,
    dbt_vars={
        "start_dt": '{{ data_interval_end.strftime("%Y-%M-%d") }}',
    },
)

dbt model:

select *
from source("my_source")
where date_time >= '{{ var("start_dt") }}'

error:

  'data_interval_end' is undefined. This can happen when calling a macro that does not exist. Check for typos and/or install package dependencies with "dbt deps".

I think this is due to how yaml.dump works in here (I think using json.dumps will work...). Is there any way I can pass variable to dbt_models with macro function?

Anything else :)?

written in above

Are you willing to submit PR?

Contact Details

rightx2@gmail.com

dosubot[bot] commented 4 months ago

Hello @rightx2! 👋 I'm here to help you with any bugs, questions, or contributions. Let me know how I can assist you!

The issue you're encountering with the data_interval_end variable being undefined when using dbt_vars in Cosmos might be related to how the variable is being templated and passed. The yaml.dump method is used to convert the dbt_vars dictionary to a YAML string, which might not handle Jinja templating as expected.

To address this, you can try using json.dumps instead of yaml.dump to ensure that the variables are correctly passed to dbt models. Here is an example of how you can modify your configuration:

import json
from cosmos.config import ProjectConfig, ProfileConfig, RenderConfig, ExecutionConfig, LoadMode, ExecutionMode
from cosmos.task_group import DbtTaskGroup

project_config = ProjectConfig(
    dbt_project_path="/opt/airflow/dbt/data-dbt",
    project_name="data_dbt",
    partial_parse=True,
    dbt_vars=json.dumps({
        "start_dt": '{{ data_interval_end }}',
    }),
)

profile_config = ProfileConfig(
    profile_name="data_dbt",
    target_name="dev",
    profiles_yml_filepath=Path("/opt/airflow/dbt/data-dbt", "profiles", "profiles.yml"),
)

render_config = RenderConfig(
    load_method=LoadMode.DBT_LS,
    emit_datasets=False,
    test_behavior=TestBehavior.AFTER_EACH,
    select=[
        "models/my_model.sql",
    ],
)

execution_config = ExecutionConfig(
    execution_mode=ExecutionMode.LOCAL,
)

operator_args = {
    "install_deps": True,
}

dbt_op = DbtTaskGroup(
    group_id="dbt_group",
    project_config=project_config,
    profile_config=profile_config,
    render_config=render_config,
    execution_config=execution_config,
    operator_args=operator_args,
)

This change should help ensure that the data_interval_end variable is correctly passed and templated within your dbt models. If the issue persists, it might be worth checking if there are any specific compatibility concerns between the versions of Cosmos and dbt-core you are using [1][2].

To continue talking to Dosu, mention @dosu.

rightx2 commented 4 months ago

docs seems to say it worked... looks strange

dwreeves commented 4 months ago

Really important question: Does this error occur in task execution or in DAG parsing/rendering?

I.e. is the error message on the homepage of Airflow and causes the DAG to not even load at all?

dwreeves commented 4 months ago

I'm going ahead with the assumption that this occurs in DAG rendering and not task execution. The reason why is because I'm pretty confident I know what the problem is because I knew back in January that this problem would one day bite someone in the butt, and I have advocated for making changes to this precisely because of the issue you just ran into.

The problem:

Right now dbt vars + env are strongly coupled across both rendering and execution, but they should be looser because of precisely what you are attempting to do.

In Airflow, {{ templated_variables }} are not normally resolved until after a DagRun is initiated. So what happens is when your DagRun initiates and the task runs, {{ data_interval_end.strftime("%Y-%M-%d") }} becomes (for example) "2024-06-21".

During rendering of the DAG, Jinja2 is not used at all. This means that the string literal "{{ data_interval_end.strftime("%Y-%M-%d") }}" is passed to dbt. Because dbt uses Jinja, this means dbt is attempting to render it in its own Jinja2 environment, which doesn't have the same variables as Airflow's jinja environment.

The reason it doesn't raise an error when you do {{ data_interval_end }} is because Jinja2 by default will parse a variable not in the namespace as none. {{ asdfjkl123456789 }} (i.e. gibberish) will not raise an error in Jinja2. However, when you attempt to call a method of an un-namespaced variable, then this is where errors can occur. E.g. {{ fake_variable }} works but {{ fake_variable.fake_method() }} will raise an error.

How you can fix today:

You should look into using LoadMode.DBT_MANIFEST instead of LoadMode.DBT_LS.

How Cosmos can fix:

As per my comment in January, vars and the env should be allowed to be decoupled. Errors should not be raised when a user attempts to set vars.

rightx2 commented 4 months ago

your assumption is right: it happend in rendering time. And the reason of the problem I was thinking about matches exactly with what you mentioned.. I think I'd take another render method. Thanks

dwreeves commented 4 months ago

One more note I didn't mention is that your use case is not atypical. I think injecting DagRun variables like the data interval end should be supported. It's very natural to want to do that. And it clearly is not supported right now. I think we should make this a more explicitly supported pattern. So keep doing what you're doing and don't be discouraged!

rightx2 commented 4 months ago

of course i will : )