cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Bug: `transform_warehouse` DAG appears to not be running on Mondays #3420

Open charlie-costanzo opened 1 month ago

charlie-costanzo commented 1 month ago

Describe the bug Airflow/Composer transform_warehouse DAG appears to not be running on Mondays

To Reproduce Airflow/Composer GUI schedule history shows transform_warehouse DAG not running on Mondays, although schedule interval is defined as 0 14 * * 1-6

Expected behavior DAG should run every day, Monday-Saturday

mjumbewu commented 1 month ago

@charlie-costanzo tl;dr: I think the easiest fix would be to set the start_date for the DAG to a static value.


The Monday runs are getting executed -- it's the Saturday runs that are getting skipped. Because of how Airflow's scheduler works, each run gets executed at the end of its interval (i.e., at the beginning of the next schedule interval).

It looks like the Saturday runs are getting scheduled (as you can see for the Aug 03 run in the following screenshot) just like the Monday through Friday runs: image

But they're never actually getting executed (as you can see by the missing Aug 03 run in the following screenshot): image

My hypothesis is that this is because of the start_date definition in the DAG metadata: https://github.com/cal-itp/data-infra/blob/796a3e178d6d214fa58bc2da67215d5ede577a49/airflow/dags/transform_warehouse/METADATA.yml#L1-L8

What I think is happening here is that the Airflow scheduler gets restarted a few times every day (which is fine in theory), but when it gets restarted, the start_date for the DAG gets reset to 1 day ago (i.e., !days_ago 1). Since there are two days between when a Saturday run gets scheduled and when it gets executed, the start date is always too late. For example, in this case, on 08-03 the 08-03 run gets scheduled for 08-05. On 08-05, before the run executes, the scheduler restarts and sets the start_time of the DAG to some time on 08-04, thus invalidating the 08-03 run (it's before the start time). This is why the Airflow FAQs recommend against dynamic start_date values: https://airflow.apache.org/docs/apache-airflow/2.0.1/faq.html#what-s-the-deal-with-start-date. It rarely makes a difference, but when it does it's perplexing.

ohrite commented 4 weeks ago

Hmm, looks like the crontab definition ("0m 14h mon-sat") is skipping the run that would happen on Saturday (Sunday UTC). Here's the breakdown: https://crontab.guru/#0_14_*_*_1-6.

Potential resolution: set schedule_interval: "0 14 * * *"

erikamov commented 1 week ago

I was checking the logs and the transform_warehouse_full_refresh_sunday) is running correctly on Sundays: Image

And the start_date for the full refresh was already changed: https://github.com/cal-itp/data-infra/blob/796a3e178d6d214fa58bc2da67215d5ede577a49/airflow/dags/transform_warehouse_full_refresh_sunday/METADATA.yml#L8

But there is definitely a problem with the date for the transform_warehouse_full_refresh): Image

The scheduling and running dates are different and it is skipping Mondays. So we can also try to change the start_date as @mjumbewu pointed.

erikamov commented 1 week ago

I found more information about how dates work on a DAG and sounds like the date difference I pointed in the previous comment:

What does execution_date mean?

Airflow was developed as a solution for ETL needs. In the ETL world, you typically summarize data. So, if you want to summarize data for 2016-02-19, you would do it at 2016-02-20 midnight UTC, which would be right after all data for 2016-02-19 becomes available. This interval between midnights of 2016-02-19 and 2016-02-20 is called the data interval, and since it represents data in the date of 2016-02-19, this date is also called the run’s logical date, or the date that this DAG run is executed for, thus execution date.

Image

So on this history for transform_warehouse, on Monday "2024-09-02" the DAG run id and execution_date shows the same date ("2024-09-02"), but it only runs on "2024-09-03"(Tuesday) as you can also see on TIMESTAMP, start_date, and end_date).

Then the same happens for the next days until Saturday that it also runs the day before, Friday. Sunday is not on the transform_warehouse schedule, so this would mean that we are actually missing Saturdays. In this case, since the transform_warehouse_full_refresh is scheduled for Sundays anyway, it would be running everything and we would not be missing anything?

What do you think @charlie-costanzo @mjumbewu? Does it sounds correct?

erikamov commented 22 hours ago

After testing and talking to @mjumbewu and @charlie-costanzo we found that the problem is the dynamic start date. Since it is set to One Day Ago, there is nothing running on Mondays because the data internal for Saturday it is more than one day ago.

I created a PR setting a real start_date, waiting for review to merge.