apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
36.95k stars 14.26k forks source link

Grouping Dataset Events to Trigger DAGs #42015

Open dirrao opened 1 month ago

dirrao commented 1 month ago

Description

No response

Use case/motivation

To handle multiple dataset updates efficiently and avoid triggering a DAG for every small dataset update (like a tiny partition), you can implement a "batching" mechanism where the DAG waits for a group of dataset events before triggering. This way, you avoid redundant DAG runs and ensure the DAG only executes when enough meaningful updates have occurred.

Related issues

No response

Are you willing to submit a PR?

Code of Conduct

dirrao commented 1 month ago

Hi @uranusjr, could you provide your feedback on this feature request when you have a moment?

uranusjr commented 1 month ago

Isn’t this basically the idea behind AIP-76?

dirrao commented 1 month ago

Possibly related, but I'm not sure. Does that include batching the events?

dirrao commented 1 month ago

The AIP-82 contains this functionality. https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow

uranusjr commented 1 month ago

The only mention on batching I can find in AIP-82 is under the Out of Scope section.

AIP-76 does not do batching, but works on a different level, separating individual events from triggering the actual downstream run. I do not know if it fits your use case; only you can decide.

dirrao commented 1 month ago

Currently, we're using a pull-based mechanism and triggering dataset creation events via the REST API. However, if we want to trigger events only when there are enough accumulated, we have to maintain an external state, group the events, and then send the dataset creation request. It would be great to have this feature built-in as part of the existing functionality.

dirrao commented 1 month ago

It is challenging to map data warehouse partitions to the dataset partitions mentioned in AIP-76. As a result, batching events for triggering DAG runs is not feasible in this context.