The transform_warehouse DAG is not running on Mondays and this PR is intended to fix the issue #3420.
Based on Airflow concept, Mondays should be running the data interval Saturday-Sunday, but since the start_date is configured to only One Day Ago, it does not have anything to run on Mondays because Sundays are not part of the schedule.
As @mjumbewu mentioned on the Issue and going through the problem with him, we found out that setting up a real start date would fix the problem.
For more information about the problem see the screenshot bellow that shows (on the left side) a list of dates from Mondays to Fridays, but they are actually running from Tuesdays to Saturdays as the TIMESTAMP, Start date, and End date shows.
Airflow schedule works different than regular "jobs", to clarify here what the documentation says:
Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if you want to summarize data for 2016-02-19, you would do it at 2016-02-20 midnight UTC, which would be right after all data for 2016-02-19 becomes available. This interval between midnights of 2016-02-19 and 2016-02-20 is called the data interval, and since it represents data in the date of 2016-02-19, this date is also called the run’s logical date, or the date that this DAG run is executed for, thus execution date.
Dates Concept from airflowAll dates in Airflow are tied to the data interval concept in some way. The “logical date” (also called execution_date in Airflow versions prior to 2.2) of a DAG run, for example, denotes the start of the data interval, not when the DAG is actually executed.
Similarly, since the start_date argument for the DAG and its tasks points to the same logical date, it marks the start of the DAG’s first data interval, not when tasks in the DAG will start running. In other words, a DAG run will only be scheduled one interval after start_date.
Type of change
[X] Bug fix (non-breaking change which fixes an issue)
[ ] New feature
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] Documentation
How has this been tested?
Tested schedules on Airflow locally.
Post-merge follow-ups
[ ] No action required
[X] Actions required (specified below)
Monitor production Airflow to ensure that the changes take effect as expected.
Description
The transform_warehouse DAG is not running on Mondays and this PR is intended to fix the issue #3420.
Based on Airflow concept, Mondays should be running the data interval Saturday-Sunday, but since the
start_date
is configured to only One Day Ago, it does not have anything to run on Mondays because Sundays are not part of the schedule.As @mjumbewu mentioned on the Issue and going through the problem with him, we found out that setting up a real start date would fix the problem.
There is also a previous issue that was fixed the same way: https://github.com/cal-itp/data-infra/pull/3323
For more information about the problem see the screenshot bellow that shows (on the left side) a list of dates from Mondays to Fridays, but they are actually running from Tuesdays to Saturdays as the TIMESTAMP, Start date, and End date shows.
Airflow schedule works different than regular "jobs", to clarify here what the documentation says:
What does execution_date mean?
Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if you want to summarize data for 2016-02-19, you would do it at 2016-02-20 midnight UTC, which would be right after all data for 2016-02-19 becomes available. This interval between midnights of 2016-02-19 and 2016-02-20 is called the data interval, and since it represents data in the date of 2016-02-19, this date is also called the run’s logical date, or the date that this DAG run is executed for, thus execution date.
Dates Concept from airflow
All dates in Airflow are tied to the data interval concept in some way. The “logical date” (also called execution_date in Airflow versions prior to 2.2) of a DAG run, for example, denotes the start of the data interval, not when the DAG is actually executed.
Similarly, since the start_date argument for the DAG and its tasks points to the same logical date, it marks the start of the DAG’s first data interval, not when tasks in the DAG will start running. In other words, a DAG run will only be scheduled one interval after start_date.
Type of change
How has this been tested?
Tested schedules on Airflow locally.
Post-merge follow-ups
Monitor production Airflow to ensure that the changes take effect as expected.