Open dirrao opened 1 month ago
Hi @uranusjr, could you provide your feedback on this feature request when you have a moment?
Isn’t this basically the idea behind AIP-76?
Possibly related, but I'm not sure. Does that include batching the events?
The AIP-82 contains this functionality. https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-82+External+event+driven+scheduling+in+Airflow
The only mention on batching I can find in AIP-82 is under the Out of Scope section.
AIP-76 does not do batching, but works on a different level, separating individual events from triggering the actual downstream run. I do not know if it fits your use case; only you can decide.
Currently, we're using a pull-based mechanism and triggering dataset creation events via the REST API. However, if we want to trigger events only when there are enough accumulated, we have to maintain an external state, group the events, and then send the dataset creation request. It would be great to have this feature built-in as part of the existing functionality.
It is challenging to map data warehouse partitions to the dataset partitions mentioned in AIP-76. As a result, batching events for triggering DAG runs is not feasible in this context.
Description
No response
Use case/motivation
To handle multiple dataset updates efficiently and avoid triggering a DAG for every small dataset update (like a tiny partition), you can implement a "batching" mechanism where the DAG waits for a group of dataset events before triggering. This way, you avoid redundant DAG runs and ensure the DAG only executes when enough meaningful updates have occurred.
Related issues
No response
Are you willing to submit a PR?
Code of Conduct