cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link

Fix transform_warehouse schedule to run on Mondays #3456

Closed erikamov closed 1 month ago

erikamov commented 2 months ago

Description

The transform_warehouse DAG is not running on Mondays and this PR is intended to fix the issue #3420.

Based on Airflow concept, Mondays should be running the data interval Saturday-Sunday, but since the start_date is configured to only One Day Ago, it does not have anything to run on Mondays because Sundays are not part of the schedule.

As @mjumbewu mentioned on the Issue and going through the problem with him, we found out that setting up a real start date would fix the problem.

There is also a previous issue that was fixed the same way: https://github.com/cal-itp/data-infra/pull/3323

For more information about the problem see the screenshot bellow that shows (on the left side) a list of dates from Mondays to Fridays, but they are actually running from Tuesdays to Saturdays as the TIMESTAMP, Start date, and End date shows. Screenshot 2024-09-10 at 11 17 00 AM

Airflow schedule works different than regular "jobs", to clarify here what the documentation says:

What does execution_date mean?

Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if you want to summarize data for 2016-02-19, you would do it at 2016-02-20 midnight UTC, which would be right after all data for 2016-02-19 becomes available. This interval between midnights of 2016-02-19 and 2016-02-20 is called the data interval, and since it represents data in the date of 2016-02-19, this date is also called the run’s logical date, or the date that this DAG run is executed for, thus execution date.

Dates Concept from airflow All dates in Airflow are tied to the data interval concept in some way. The “logical date” (also called execution_date in Airflow versions prior to 2.2) of a DAG run, for example, denotes the start of the data interval, not when the DAG is actually executed.

Similarly, since the start_date argument for the DAG and its tasks points to the same logical date, it marks the start of the DAG’s first data interval, not when tasks in the DAG will start running. In other words, a DAG run will only be scheduled one interval after start_date.

Type of change

How has this been tested?

Tested schedules on Airflow locally.

Post-merge follow-ups

Monitor production Airflow to ensure that the changes take effect as expected.