Closed jeffolsi closed 2 years ago
An immediate solution to your last sentence is to use timedelta. This is also supported: schedule_interval=timedelta(weeks=2)
.
An immediate solution to your last sentence is to use timedelta. This is also supported:
schedule_interval=timedelta(weeks=2)
.
It's not the same. When specifying cron exp you guaranty that tasks will be fired when the time comes. If you use timedelta(weeks=2)
you are risking that a delay in running of one task will cause further delay in others as it always look for 2 weeks difference than the last task
to explain lets use daily for simplicity: 2020-04-28 0 0 * - this will run every day:
2020-04-29 00:00:00 2020-05-01 00:00:00
Now lets say that airflow was down and the run of 2020-04-29 00:00:00 started to run on 2020-04-29 04:00:00, the next run will still be on 2020-05-01 00:00:00
On the other hand with: 2020-04-28 timedelta(days=1) if the run of 2020-04-29 00:00:00 started to run on 2020-04-29 04:00:00, the next run will still be on 2020-05-01 04:00:00 The whole schedule is shifted because of the delay!
Can you provide an example (screenshot/code/whatever) where that happens? As far as I know, the next execution date is always computed with the start_date
and schedule_interval
, not the execution date of the last DAG run.
@BasPH This is the DAG defintion:
with DAG(
dag_id=DAG_NAME,
default_args=default_args,
schedule_interval=timedelta(minutes=60),
max_active_runs=1,
catchup=False
) as dag:
This is an example for the execution times:
As you can this DAG is hourly by timedelta(minutes=60)
but it's not the same as specifying @hourly
or 0 * * * *
. You can also see the gap in times (marked in red) when Airflow was down. When it got up again it gave a "new" timestamp to the execution_date.
I'm sure you can understand that there is no business logic behind the time stamp of XX:46:10.998426
So as said before timedelta(minutes=60)
is not equivalent to @hourly
or cron job experssion.
Thanks for pointing this out @jeffolsi, that indeed makes no sense and seems like a fundamental error which should be fixed. What version are you running on? Let's make a separate issue for it.
Regarding the multiple cron expressions, I've seen the request multiple times and think it would be a good addition. The apscheduler library has something for combining intervals: https://apscheduler.readthedocs.io/en/stable/modules/triggers/combining.html. I think similar behaviour would be nice to integrate in Airflow too.
@BasPH I'm running 1.10.3
I'm not sure what exactly to report on the new issue. I don't consider this a bug but maybe i'm wrong. I just wanted to explain why the suggestion to use timedelta()
does not solve this issue so Airflow needs to support multipule cron expressions for single DAG.
I think this is a very important feature for Airflow.
@BasPH @jeffolsi
Came across a simple implementation for combining multiple cron strings and croniter
objects here
I would like to work on this.
The idea would be to allow a list of cron expressions as a schedule_interval
. For example, the scheduling in the description would be defined as schedule_interval = ['30/10 16 * * *', '*/10 17 * * *', '0,10 18 * * *']
. Do you think this is the way to go?
@mdediana We had long discussions about whether to support multiple scheduler intervals. Many people think that this can affect the presentation and readability of the collected data. This can also complicate the scheduler logic. Can you describe your idea on the mailing list?
@mik-laj
I would recommend that the user be allow to supply a list of cron strings or cron strings with comma separation. I would then implement a object that has internal logic like this implementation of scheduling with multiple croniter
objects. The object should also have a get_next()
function similar to the one currently used by the DAG
object (see following
implementation). If just one cron string is supplied then the DAG uses the croniter
object as is currently implemented.
@mik-laj Sure, I will do that, thanks.
Is there any update on this?
@mdediana
I would like to work on this.
The idea would be to allow a list of cron expressions as a
schedule_interval
. For example, the scheduling in the description would be defined asschedule_interval = ['30/10 16 * * *', '*/10 17 * * *', '0,10 18 * * *']
. Do you think this is the way to go?
This will be of great help. Instead of creating separate DAGs for the same job (like what currently I am doing), this would reduce to just 1 DAG taking care of multiple schedules. One workaround right now is if the crons are not strict, one can tweak multiple crons to have the minutes dimension same for all, for ex : "45 0,8,13 *", this will run for 0045, 0845 and 1345 Hrs respectively. Unfortunately, the crons in my case are strict (0100, 0815 and 1330 Hrs), hence have to create 3 separate DAGs. Enabling schedule interval to accept list of crons would be very helpful :) 👍
I've started a discussion thread on this on the dev mailing list to scope out what a solution to this will look like https://lists.apache.org/thread.html/rb4e004e68574e5fb77ee5b51f4fd5bfb4b3392d884c178bc767681bf%40%3Cdev.airflow.apache.org%3E
Use cases there would be ace (and feedback once we come up with a design)
I think the request as described here (bi-weekly job) is covered fully by AIP 39 already using Timetables https://airflow.apache.org/docs/apache-airflow/stable/concepts/timetable.html
Closing as issue solved
Description Allow DAG to accept list of cron expression and schedule the dag in correlation to all of them. Similar to how it can be done in cron job
Use case / motivation Some scheduling like: every 10 min between 16:30 to 18:10 can not be obtained with single cron expression. The idea is that DAG will have the ability to be set according to more than 1 cron but without duplicating the DAG code or the DAG entry in the UI
Even simple scheduling which is common for ETL : bi-weekly can not be done with single cron expression: https://serverfault.com/questions/404398/how-to-schedule-a-biweekly-cronjob