astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
https://astronomer.github.io/astronomer-cosmos/
Apache License 2.0
758 stars 170 forks source link

[Bug] Models that have the same name as the dbt project will include the entire project in their run task #1306

Open alexr-lh opened 1 week ago

alexr-lh commented 1 week ago

Astronomer Cosmos Version

1.6.0

dbt-core version

1.8.7

Versions of dbt adapters

dbt-bigquery==1.8.3

LoadMode

DBT_LS

ExecutionMode

LOCAL

InvocationMode

None

airflow version

2.7.3

Operating System

Cloud Composer

If a you think it's an UI issue, what browsers are you seeing the problem on?

No response

Deployment

Google Cloud Composer

Deployment details

No response

What happened?

Models that have the same name as the dbt package will include the entire package in their run statement.

I have a model which shares a name with the project. Instead of the run task building a single model it's building the entire project. This in turn causes intermittent failures when temporary tables required by other models are being deleted mid run as they're being run in two tasks.

I'd expect this to be looking specifically for single models.

Relevant log output

No response

How to reproduce

Create a project named name_clash Create a model named no_clash and set it to incremental mode. Create a model also named name_clash and set to incremental mode (could probably be any materialization). Make sure the query in the no_clash model takes a while to complete. Trigger the DAG You'll see the Cosmos DAG task for no_clash contains just one model in the output while name_clash includes both of the models.

Anything else :)?

No response

Are you willing to submit PR?

Contact Details

alex.richardson@loveholidays.com

dosubot[bot] commented 1 week ago

Hey there, @alexr-lh! I'm here to assist you with any bugs, questions, or contribution opportunities. Let's work together to tackle this issue!

The issue you're experiencing seems to be related to how Cosmos handles model selection when a model shares the same name as the dbt package. Unfortunately, there isn't specific information available in the Cosmos repository about handling model name clashes with the dbt package. This might be a limitation or an oversight in the current implementation.

Given the behavior you're observing, it seems that Cosmos is not correctly differentiating between a single model and the entire project when the model name matches the package name. This results in the entire package being included in the run statement instead of just the single model.

To address this, you might consider submitting a pull request to the Cosmos repository to handle this specific case. You could propose a change that ensures the run statement targets only the specific model, even when its name matches the package name. This would involve modifying the logic that determines which models to include in a run statement, possibly by adding a check to differentiate between a model and a package when their names clash.

If you're willing to submit a PR, as indicated, this could be a valuable contribution to the project.

To continue talking to Dosu, mention @dosu.

alexr-lh commented 1 week ago

@dosu Where in the codebase is the logic for determining what to include in the run task kept?

dosubot[bot] commented 1 week ago

The logic for determining what to include in the run task, specifically for dbt model selection, is located in the file cosmos/dbt/selector.py. This file contains classes and functions such as NodeSelector, GraphSelector, select_nodes, apply_select_filter, and apply_exclude_filter, which implement the logic for selecting dbt nodes based on various criteria like paths, tags, and configurations [1].

To continue talking to Dosu, mention @dosu.

alexr-lh commented 1 week ago

@dosu Is that used for the DBT_LS_FILE load method? That file seems to support the select/exclude options which are not relevant for the DBT_LS_FILE method.

dosubot[bot] commented 1 week ago

The cosmos/dbt/selector.py file, which supports select and exclude options, is not directly relevant for the DBT_LS_FILE load method. The DBT_LS_FILE method does not support RenderConfig.select and RenderConfig.exclude options. Instead, it relies on the output of the dbt ls command from a file, without applying additional selection or exclusion logic [1]. The handling for DBT_LS_FILE is more likely implemented in the cosmos/dbt/graph.py file, which manages various load methods, including DBT_LS_FILE [2].

To continue talking to Dosu, mention @dosu.

tatiana commented 6 days ago

Hi @alexr-lh, Thank you very much for reporting this issue. We were not testing this use case, and I'm glad you identified and reported it.

Based on your description, the problem is building the dbt command when running model tasks. This does not seem to be an issue in LoadMethod.DBT_LS because it does not affect how the DAG is converted from dbt, from a topology perspective, but on the task execution itself.

My impression is that if you were using another execution mode, you'd still face the same problem.

The commands run by Cosmos operators are built using a few parts of the code, using inheritance. In this particular case, two relevant parts are:

  1. https://github.com/astronomer/astronomer-cosmos/blob/a5de8b4dc5184b46ca8358025a01b5107c747ffc/cosmos/operators/local.py#L472
  2. https://github.com/astronomer/astronomer-cosmos/blob/a5de8b4dc5184b46ca8358025a01b5107c747ffc/cosmos/operators/base.py#L361

We'd love to receive a contribution, please, let us know if you'd like any support.