apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.96k stars 14.26k forks source link

Atlas Backend module improvements #8969

Open nodyke opened 4 years ago

nodyke commented 4 years ago

Description

There is a requirement to send lineage info from Apache Airflow to Apache Atlas as a part of data lineage implementation. However, current module which was represented in 1.10 stable version has a few problems:

  1. It creates new Atlas Operator entity for each DagRun.
  2. It's impossible to control missing entities creation using configs.
  3. It fails operator if lineage sending was not succesful.
  4. HTTP timeout can't be configured.
  5. Current Atlas type definition has a small set of attributes.
  6. Errors in class wrappers for Atlas types.

Use case / motivation

As a part of analytic data platform, auto import of data lineage is needed and the most part of data lineage should be send by Airflow in auto mode. Our module uses old Atlas backend module as a base, but contains fixes and improvements. What was fixed:

  1. Creation of Atlas entity of Airflow operator doesn't use execution date anymore.
  2. Added config property for enabling/disabling missing inlets and outlets entities creation.
  3. Added config property for enabling/disabling operator failure if lineage sending was unsuccessful.
  4. Added config property for the Atlas timeout.
  5. Added "template_fields" into Airflow operator typedef and added additional config property for setting any additional operator attributes
  6. Fixed DataSet class wrapper, added abstract types for file and JDBC source
  7. Added utils methods for correct inlets and outlets objects generating.

Related Issues

AIRFLOW-5912

boring-cyborg[bot] commented 4 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

thibaultbl commented 2 years ago

Do you think this module will be implemented any time soon ?

If not, It could at least be usefull to add few examples to use custom LineageBackend.

uranusjr commented 2 years ago

Feel free to contribute to either the module or the examples.

thibaultbl commented 2 years ago

I am willing to spend some time on it.

Nevertheless, it require some change to Operator class, mainly adding a "lineage_data" attribute to every Operator, are you open to this addition ?

uranusjr commented 2 years ago

It kind of depends on what that attribute would hold, how it will be populated, persisted, and used. The act of adding that attribute should be easy and without negative consequences (just add it to BaseOperator).