astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
https://astronomer.github.io/astronomer-cosmos/
Apache License 2.0
753 stars 170 forks source link

Allow users to retrieve a model/seed/snapshot Dataset URI via a function #1036

Open tatiana opened 5 months ago

tatiana commented 5 months ago

Description

Cosmos should allow users to easily retrieve the Dataset URI generated for a dbt node (e.g. model) via a function e.g. get_dataset so they easily leverage Airflow Data-aware scheduling.

This issue has been raised several times in the #airflow-dbt slack channel, including:

Use case/motivation

Before Cosmos 1.1, the URIs in Cosmos did not uniquely identify datasets. They were created during DAG parsing time using:

f"DBT://{connection_id.upper()}/{project_name.upper()}/{model_name.upper()}"

The Airflow connection identifier does not represent the actual data warehouse where the data is stored, the database, or its schema. Different environments (dev and prod) can have the same connection identifier to refer to other databases. The same Airflow deployment can have different connections referring to the same database. The same table can be referenced in multiple dbt projects - but with these DatasetURIs, they'd look different if they were imported from different dbt projects.

After several discussions with multiple people, we decided to use OpenLineage URIs to define Cosmos Datasets #485 (related issue #305). We decided to use OpenLineage naming convention. One of the advantages of this approach is that people using OpenLineage could easily link to Airflow datasets. Another advantage was that we could reuse an existing library responsible for maintaining those URIs consistent with convention changes - the OpenLineage Integration Common package. One drawback with this approach is that the datasets must be set during task execution.

This lead to Cosmos Dataset URIs look similar to:

"postgres://host:5432/database.schema.table"

By the time we tested this implementation, we were not aware of two drawbacks:

  1. Airflow versions until 2.9 were not designed to support datasets being generated on task execution (https://github.com/astronomer/astronomer-cosmos/issues/522, https://github.com/apache/airflow/issues/34206)
  2. We didn't realise that get_dbt_dataset was exposed to end-users (https://github.com/astronomer/astronomer-cosmos/pull/1034)

Related issues

Are you willing to submit a PR?

github-actions[bot] commented 1 day ago

This issue is stale because it has been open for 30 days with no activity.