apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.02k stars 14.27k forks source link

Add support for "H" syntax in cron scheduling #13788

Closed raj-manvar closed 2 years ago

raj-manvar commented 3 years ago

Description

It could be beneficial for Airflow to support Jenkins' "H" cron syntax in Airflow scheduling. The reason for this is to mitigate a stampede of tasks at the top of every hour / some interval, which can currently straining resources based on applications.

"H" syntax specifies to run the DAG during a window of time, allowing the scheduler to spread out jobs based on a hash value. For instance, the syntax "H(0-15) " means to schedule any time in the first 15 minutes, or "H " would mean to schedule during any minute of the hour.

Use case / motivation

Aim is to resolve the stampede of task occuring at some hour of day or at some midnight of day of month. Currently we need to reserve more resources for Airflow to handle peaks of many tasks trying to schedule because of this. H syntax will help with better distribution of load with time and save resources.

Are you willing to submit a PR?

Yup. from some code digging, it looks like Airflow does the crontab scheduling using some Python library. If the library already supports H syntax, it'd be simpler, but if not I'd need some more guidance / research support

Related Issues

boring-cyborg[bot] commented 3 years ago

Thanks for opening your first issue here! Be sure to follow the issue template!

potiuk commented 3 years ago

I think this one might be difficult before switching (or rather enabling) suport for "regular" cron behaviour. Airflow does NOT work like cron even if the specification is cron-like. Airflow works on "data intervals" rather than. on CRON schedule. It means that the time specified in DAG is not the schedule, but rather indication which data interval Airflow should work on. It indicates the "beginnin" of the data interval each run should work on. This mean tha Airflow starts at midnight finishing Monday if you want to process Monday's data.

This is not intuitive and we discuss if it should be allowed to run "cron jobs" regularly. And while I can imagine H () might be used there as well it is gonna be even more confusing (as the data interval should still be covering full day till midnight even if the job starts 15 minutes later. For me having a "random delay" in the DAG definition as separate parameter would be better.

pgrandjean commented 3 years ago

What would be the impact of replacing "date intervals" with a CRON-like behaviour? What important features of Airflow would be lost?

eladkal commented 3 years ago

@pgrandjean you might want to take a look at AIP-39 Richer scheduler_interval

eladkal commented 2 years ago

covered by AIP-39. You can use Timetable to achieve similar functionality https://airflow.apache.org/docs/apache-airflow/stable/concepts/timetable.html