astronomer / astronomer-cosmos

Run your dbt Core projects as Apache Airflow DAGs and Task Groups with a few lines of code
https://astronomer.github.io/astronomer-cosmos/
Apache License 2.0
607 stars 153 forks source link

Dataset Aliases #1119

Open pankajkoti opened 2 months ago

pankajkoti commented 2 months ago

Description co-authored by @tatiana @pankajastro

Since Cosmos 1.1, it creates Airflow inlets and outlets for every dbt model/seed/snapshot task, which allows end-users to leverage Airflow Data-aware scheduling.

In the past, Cosmos had identified these inlets and outlets using URIs that were not representative of the dataset being created. The one advantage with this approach is that the identifiers could be created during DAG parsing/processing time.

This changed in the 1.1 release, when we decided to adopt the OpenLineage naming convention to describe Airflow Datasets created by Cosmos (inlets/outlets). They became something similar to: "postgres://0.0.0.0:5432/postgres.public.stg_customers". The downside with this approach was: we started using a library openlineage-integration-common that can only create the resources URIs after the dbt command was run, since it currently relies on dbt-core artefacts. This means we started creating inlets/outlets during task execution.

A side-effect of this change was that Airflow <= 2.9 was not designed to support setting inlets and outlets during task execution, which resulted in this long-standing issue: https://github.com/astronomer/astronomer-cosmos/issues/522

Another side effect was that, since we started relying on task execution to determine the Airflow dataset identifier, we didn't expose end-users to a method for easily determining it. More context on https://github.com/astronomer/astronomer-cosmos/issues/1036.

The community very often raises that.

We created an issue in Airflow: https://github.com/apache/airflow/issues/34206

After several discussions with @uranusjr, he proposed introducing the concept of DatasetAliases to Airflow 2.10. @Lee-W worked on this: https://github.com/apache/airflow/pull/40478

This feature will be released as part of Airlfow 2.10.

The goal of this epic is to leverage Airflow DatasetAliasses in Cosmos, so that:

Initially planned tasks, more to be added as part of the PoC ticket:

tatiana commented 3 days ago

I made significant progress on this task, as can be seen in PR #1217.

Yesterday, I implemented the changes to the code itself (no tests, just a quick PoC). Today, I validated and made a minor adjustment to make it work.

The change works as expected in Astro CLI. Using Airflow standalone doesn't work so well. I connected with Wei about this and he'll further investigate.

I was able to see the Datasets/Datasets Alias in the Airflow UI.

I was also able to see a DAG being triggered. I'll soon share more information on this.