Open charlie-costanzo opened 1 month ago
@charlie-costanzo tl;dr: I think the easiest fix would be to set the start_date
for the DAG to a static value.
The Monday runs are getting executed -- it's the Saturday runs that are getting skipped. Because of how Airflow's scheduler works, each run gets executed at the end of its interval (i.e., at the beginning of the next schedule interval).
It looks like the Saturday runs are getting scheduled (as you can see for the Aug 03 run in the following screenshot) just like the Monday through Friday runs:
But they're never actually getting executed (as you can see by the missing Aug 03 run in the following screenshot):
My hypothesis is that this is because of the start_date
definition in the DAG metadata:
https://github.com/cal-itp/data-infra/blob/796a3e178d6d214fa58bc2da67215d5ede577a49/airflow/dags/transform_warehouse/METADATA.yml#L1-L8
What I think is happening here is that the Airflow scheduler gets restarted a few times every day (which is fine in theory), but when it gets restarted, the start_date
for the DAG gets reset to 1 day ago (i.e., !days_ago 1
). Since there are two days between when a Saturday run gets scheduled and when it gets executed, the start date is always too late. For example, in this case, on 08-03 the 08-03 run gets scheduled for 08-05. On 08-05, before the run executes, the scheduler restarts and sets the start_time
of the DAG to some time on 08-04, thus invalidating the 08-03 run (it's before the start time). This is why the Airflow FAQs recommend against dynamic start_date
values: https://airflow.apache.org/docs/apache-airflow/2.0.1/faq.html#what-s-the-deal-with-start-date. It rarely makes a difference, but when it does it's perplexing.
Hmm, looks like the crontab definition ("0m 14h mon-sat") is skipping the run that would happen on Saturday (Sunday UTC). Here's the breakdown: https://crontab.guru/#0_14_*_*_1-6.
Potential resolution: set schedule_interval: "0 14 * * *"
I was checking the logs and the transform_warehouse_full_refresh_sunday) is running correctly on Sundays:
And the start_date for the full refresh was already changed: https://github.com/cal-itp/data-infra/blob/796a3e178d6d214fa58bc2da67215d5ede577a49/airflow/dags/transform_warehouse_full_refresh_sunday/METADATA.yml#L8
But there is definitely a problem with the date for the transform_warehouse_full_refresh):
The scheduling and running dates are different and it is skipping Mondays. So we can also try to change the start_date as @mjumbewu pointed.
I found more information about how dates work on a DAG and sounds like the date difference I pointed in the previous comment:
What does execution_date mean?
Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if you want to summarize data for 2016-02-19, you would do it at 2016-02-20 midnight UTC, which would be right after all data for 2016-02-19 becomes available. This interval between midnights of 2016-02-19 and 2016-02-20 is called the data interval, and since it represents data in the date of 2016-02-19, this date is also called the run’s logical date, or the date that this DAG run is executed for, thus execution date.
So on this history for transform_warehouse
, on Monday "2024-09-02" the DAG run id and execution_date shows the same date ("2024-09-02"), but it only runs on "2024-09-03"(Tuesday) as you can also see on TIMESTAMP, start_date, and end_date).
Then the same happens for the next days until Saturday that it also runs the day before, Friday.
Sunday is not on the transform_warehouse
schedule, so this would mean that we are actually missing Saturdays.
In this case, since the transform_warehouse_full_refresh
is scheduled for Sundays anyway, it would be running everything and we would not be missing anything?
What do you think @charlie-costanzo @mjumbewu? Does it sounds correct?
After testing and talking to @mjumbewu and @charlie-costanzo we found that the problem is the dynamic start date. Since it is set to One Day Ago, there is nothing running on Mondays because the data internal for Saturday it is more than one day ago.
I created a PR setting a real start_date, waiting for review to merge.
Describe the bug Airflow/Composer
transform_warehouse
DAG appears to not be running on MondaysTo Reproduce Airflow/Composer GUI schedule history shows
transform_warehouse
DAG not running on Mondays, although schedule interval is defined as0 14 * * 1-6
Expected behavior DAG should run every day, Monday-Saturday