Closed raj-manvar closed 2 years ago
Thanks for opening your first issue here! Be sure to follow the issue template!
I think this one might be difficult before switching (or rather enabling) suport for "regular" cron behaviour. Airflow does NOT work like cron even if the specification is cron-like. Airflow works on "data intervals" rather than. on CRON schedule. It means that the time specified in DAG is not the schedule, but rather indication which data interval Airflow should work on. It indicates the "beginnin" of the data interval each run should work on. This mean tha Airflow starts at midnight finishing Monday if you want to process Monday's data.
This is not intuitive and we discuss if it should be allowed to run "cron jobs" regularly. And while I can imagine H () might be used there as well it is gonna be even more confusing (as the data interval should still be covering full day till midnight even if the job starts 15 minutes later. For me having a "random delay" in the DAG definition as separate parameter would be better.
What would be the impact of replacing "date intervals" with a CRON-like behaviour? What important features of Airflow would be lost?
@pgrandjean you might want to take a look at AIP-39 Richer scheduler_interval
covered by AIP-39. You can use Timetable to achieve similar functionality https://airflow.apache.org/docs/apache-airflow/stable/concepts/timetable.html
Description
It could be beneficial for Airflow to support Jenkins' "H" cron syntax in Airflow scheduling. The reason for this is to mitigate a stampede of tasks at the top of every hour / some interval, which can currently straining resources based on applications.
"H" syntax specifies to run the DAG during a window of time, allowing the scheduler to spread out jobs based on a hash value. For instance, the syntax "H(0-15) " means to schedule any time in the first 15 minutes, or "H " would mean to schedule during any minute of the hour.
Use case / motivation
Aim is to resolve the stampede of task occuring at some hour of day or at some midnight of day of month. Currently we need to reserve more resources for Airflow to handle peaks of many tasks trying to schedule because of this. H syntax will help with better distribution of load with time and save resources.
Are you willing to submit a PR?
Yup. from some code digging, it looks like Airflow does the crontab scheduling using some Python library. If the library already supports H syntax, it'd be simpler, but if not I'd need some more guidance / research support
Related Issues