apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.51k stars 14.14k forks source link

[Question] best way to track history of task instances #1403

Closed jenelecc closed 4 years ago

jenelecc commented 8 years ago

Dear Airflow Maintainers,

We are migrating our ETL over to Airflow and had a question about tracking the history of task instances. We log run information in our data for data lineage purposes (audit_id is a tracking column in all of our tables).

A use case would be a task that populates table A with some records. Say a task instance for simple_dag, populate_A_task, 20160405 runs on 4/6 and populates table A with some records with an audit id of 123. On 4/7 we notice a bug, modify populate_A_task and then clear simple_dag, populate_A_task, 20160405 for a rerun. The rerun fails and does not change the data in the table. The audit id of the data is still 123 and should still refer to that first successful attempt that populated the table.

We first thought of using the task instance key for the audit_id, but this does not track which attempt of the task instance populated the data - it only tracks the latest attempt of the task instance. In the use case above the task instance key is tied to the latest task instance attempt (the failed rerun) instead of the previous succeeded attempt.

(I should also note that task_instance_key_str is said to be unique in the documentation - though it is not since it truncates the execution date to the day. Two externally triggered dag runs with user chosen run ids can generate two different task instances with the same task_instance_key_str.
example:
airflow trigger_dag -r test_0 basic_flow airflow trigger_dag -r test_1 basic_flow)

We then thought to use the job_id - but the job table does not include task instance information and in the use case above there is no record in task instance to tie back to the successful job since that was not the latest attempt.

We'd like to keep a TaskInstanceHistory which tracks all attempted runs of TaskInstances and were wondering if you had any thoughts on the best way to do this?

thanks much, Jennie

bolkedebruin commented 8 years ago

@jenelecc can you create a Jira for this and also have a look at AIRFLOW-20? There is some lineage in there and we can use multiple opinions on it. Thanks!