apache / airflow

Apache Airflow - A platform to programmatically author, schedule, and monitor workflows
https://airflow.apache.org/
Apache License 2.0
37.12k stars 14.3k forks source link

Add support for more than 1 cron exp per DAG #8649

Closed jeffolsi closed 2 years ago

jeffolsi commented 4 years ago

Description Allow DAG to accept list of cron expression and schedule the dag in correlation to all of them. Similar to how it can be done in cron job

Use case / motivation Some scheduling like: every 10 min between 16:30 to 18:10 can not be obtained with single cron expression. The idea is that DAG will have the ability to be set according to more than 1 cron but without duplicating the DAG code or the DAG entry in the UI

Even simple scheduling which is common for ETL : bi-weekly can not be done with single cron expression: https://serverfault.com/questions/404398/how-to-schedule-a-biweekly-cronjob

BasPH commented 4 years ago

An immediate solution to your last sentence is to use timedelta. This is also supported: schedule_interval=timedelta(weeks=2).

jeffolsi commented 4 years ago

An immediate solution to your last sentence is to use timedelta. This is also supported: schedule_interval=timedelta(weeks=2).

It's not the same. When specifying cron exp you guaranty that tasks will be fired when the time comes. If you use timedelta(weeks=2) you are risking that a delay in running of one task will cause further delay in others as it always look for 2 weeks difference than the last task

to explain lets use daily for simplicity: 2020-04-28 0 0 * - this will run every day:

2020-04-29 00:00:00 2020-05-01 00:00:00

Now lets say that airflow was down and the run of 2020-04-29 00:00:00 started to run on 2020-04-29 04:00:00, the next run will still be on 2020-05-01 00:00:00

On the other hand with: 2020-04-28 timedelta(days=1) if the run of 2020-04-29 00:00:00 started to run on 2020-04-29 04:00:00, the next run will still be on 2020-05-01 04:00:00 The whole schedule is shifted because of the delay!

BasPH commented 4 years ago

Can you provide an example (screenshot/code/whatever) where that happens? As far as I know, the next execution date is always computed with the start_date and schedule_interval, not the execution date of the last DAG run.

jeffolsi commented 4 years ago

@BasPH This is the DAG defintion:

with DAG(
    dag_id=DAG_NAME,
    default_args=default_args,
    schedule_interval=timedelta(minutes=60),
    max_active_runs=1,
    catchup=False
) as dag:

This is an example for the execution times: delay

As you can this DAG is hourly by timedelta(minutes=60) but it's not the same as specifying @hourly or 0 * * * *. You can also see the gap in times (marked in red) when Airflow was down. When it got up again it gave a "new" timestamp to the execution_date.

I'm sure you can understand that there is no business logic behind the time stamp of XX:46:10.998426

So as said before timedelta(minutes=60) is not equivalent to @hourly or cron job experssion.

BasPH commented 4 years ago

Thanks for pointing this out @jeffolsi, that indeed makes no sense and seems like a fundamental error which should be fixed. What version are you running on? Let's make a separate issue for it.

Regarding the multiple cron expressions, I've seen the request multiple times and think it would be a good addition. The apscheduler library has something for combining intervals: https://apscheduler.readthedocs.io/en/stable/modules/triggers/combining.html. I think similar behaviour would be nice to integrate in Airflow too.

jeffolsi commented 4 years ago

@BasPH I'm running 1.10.3 I'm not sure what exactly to report on the new issue. I don't consider this a bug but maybe i'm wrong. I just wanted to explain why the suggestion to use timedelta() does not solve this issue so Airflow needs to support multipule cron expressions for single DAG.

I think this is a very important feature for Airflow.

themantalope commented 4 years ago

@BasPH @jeffolsi

Came across a simple implementation for combining multiple cron strings and croniter objects here

mdediana commented 4 years ago

I would like to work on this.

The idea would be to allow a list of cron expressions as a schedule_interval. For example, the scheduling in the description would be defined as schedule_interval = ['30/10 16 * * *', '*/10 17 * * *', '0,10 18 * * *']. Do you think this is the way to go?

mik-laj commented 4 years ago

@mdediana We had long discussions about whether to support multiple scheduler intervals. Many people think that this can affect the presentation and readability of the collected data. This can also complicate the scheduler logic. Can you describe your idea on the mailing list?

themantalope commented 4 years ago

@mik-laj

I would recommend that the user be allow to supply a list of cron strings or cron strings with comma separation. I would then implement a object that has internal logic like this implementation of scheduling with multiple croniter objects. The object should also have a get_next() function similar to the one currently used by the DAG object (see following implementation). If just one cron string is supplied then the DAG uses the croniter object as is currently implemented.

mdediana commented 4 years ago

@mik-laj Sure, I will do that, thanks.

tambulkar commented 4 years ago

Is there any update on this?

sarit-si commented 4 years ago

@mdediana

I would like to work on this.

The idea would be to allow a list of cron expressions as a schedule_interval. For example, the scheduling in the description would be defined as schedule_interval = ['30/10 16 * * *', '*/10 17 * * *', '0,10 18 * * *']. Do you think this is the way to go?

This will be of great help. Instead of creating separate DAGs for the same job (like what currently I am doing), this would reduce to just 1 DAG taking care of multiple schedules. One workaround right now is if the crons are not strict, one can tweak multiple crons to have the minutes dimension same for all, for ex : "45 0,8,13 *", this will run for 0045, 0845 and 1345 Hrs respectively. Unfortunately, the crons in my case are strict (0100, 0815 and 1330 Hrs), hence have to create 3 separate DAGs. Enabling schedule interval to accept list of crons would be very helpful :) 👍

ashb commented 3 years ago

I've started a discussion thread on this on the dev mailing list to scope out what a solution to this will look like https://lists.apache.org/thread.html/rb4e004e68574e5fb77ee5b51f4fd5bfb4b3392d884c178bc767681bf%40%3Cdev.airflow.apache.org%3E

Use cases there would be ace (and feedback once we come up with a design)

eladkal commented 2 years ago

I think the request as described here (bi-weekly job) is covered fully by AIP 39 already using Timetables https://airflow.apache.org/docs/apache-airflow/stable/concepts/timetable.html

Closing as issue solved