Cosmos should allow users to easily retrieve the Dataset URI generated for a dbt node (e.g. model) via a function e.g. get_dataset so they easily leverage Airflow Data-aware scheduling.
This issue has been raised several times in the #airflow-dbt slack channel, including:
The Airflow connection identifier does not represent the actual data warehouse where the data is stored, the database, or its schema. Different environments (dev and prod) can have the same connection identifier to refer to other databases. The same Airflow deployment can have different connections referring to the same database. The same table can be referenced in multiple dbt projects - but with these DatasetURIs, they'd look different if they were imported from different dbt projects.
After several discussions with multiple people, we decided to use OpenLineage URIs to define Cosmos Datasets #485 (related issue #305). We decided to use OpenLineage naming convention. One of the advantages of this approach is that people using OpenLineage could easily link to Airflow datasets. Another advantage was that we could reuse an existing library responsible for maintaining those URIs consistent with convention changes - the OpenLineage Integration Common package. One drawback with this approach is that the datasets must be set during task execution.
This lead to Cosmos Dataset URIs look similar to:
"postgres://host:5432/database.schema.table"
By the time we tested this implementation, we were not aware of two drawbacks:
Description
Cosmos should allow users to easily retrieve the Dataset URI generated for a dbt node (e.g. model) via a function e.g.
get_dataset
so they easily leverage Airflow Data-aware scheduling.This issue has been raised several times in the #airflow-dbt slack channel, including:
Use case/motivation
Before Cosmos 1.1, the URIs in Cosmos did not uniquely identify datasets. They were created during DAG parsing time using:
The Airflow connection identifier does not represent the actual data warehouse where the data is stored, the database, or its schema. Different environments (dev and prod) can have the same connection identifier to refer to other databases. The same Airflow deployment can have different connections referring to the same database. The same table can be referenced in multiple dbt projects - but with these DatasetURIs, they'd look different if they were imported from different dbt projects.
After several discussions with multiple people, we decided to use OpenLineage URIs to define Cosmos Datasets #485 (related issue #305). We decided to use OpenLineage naming convention. One of the advantages of this approach is that people using OpenLineage could easily link to Airflow datasets. Another advantage was that we could reuse an existing library responsible for maintaining those URIs consistent with convention changes - the OpenLineage Integration Common package. One drawback with this approach is that the datasets must be set during task execution.
This lead to Cosmos Dataset URIs look similar to:
By the time we tested this implementation, we were not aware of two drawbacks:
get_dbt_dataset
was exposed to end-users (https://github.com/astronomer/astronomer-cosmos/pull/1034)Related issues
305
485
522
Are you willing to submit a PR?