Closed machow closed 2 years ago
This bug was surfaced by these issues: https://github.com/cal-itp/data-infra/issues/845, https://github.com/cal-itp/data-infra/issues/837
Here is a screencast of how I debugged https://www.loom.com/share/ec09a53e8e784d70b02aa65fe70cc136
Looking again, I'm realizing the issue is different. Basically, we have two loops
However, the list of tables we iterate over in the inner loop is created using zip. Because zip is an iterator, and used outside of all the loops, it exhausts itself after the first feed, and so the rest of feeds don't do any processing. I think the remedy is the same as my original post, but we need to wrap zip
in a list.
zip is an iterator
mind blown
So yeah, I definitely introduced this regression in 926fa255331d440a30b0031a189e0f542fd9fd3e. I saw that the table_details
variable was seemingly being unnecessarily redeclared multiple times in a nested scope, so I naively thought I ought to move it to the block scope where its variables were defined. However, as noted, the zip command creates an iterator which creates problems later on.
I have gone ahead and made https://github.com/cal-itp/data-infra/pull/872 that should fix the problem I caused. Doing the asserts sounds like a good addition to make sure this doesn't happen again.
Ah, thanks for the PR--it's definitely on me for not making the loader fail when feeds aren't there to load 😬.
In our gtfs_schedule_history_load we skip loading validation.json, since it is handled in another task (gtfs_schedule_validation_load). However, our use of a
continue
erroneously causes the entire feed loading loop to continue, skipping that feed's load. Affected days do not have processed data ingtfs-data/schedule/processed
. This means they do not get included in our gtfs_schedule (latest) dataset, and our type 2 change tables.Here is the line:
https://github.com/cal-itp/data-infra/blob/main/airflow/dags/gtfs_loader/gtfs_schedule_history_load.py#L63-L64
To fix
change continue to pass: edit: see latest comment about our use of zip.data_paths
exists (this line)gtfs_schedule_history2.merge_updates
across all timeQuery to get affected days:
We'll know we've correctly put the necessary files in processed when every row of the table
gtfs_schedule_history.calitp_feed_updates
has a corresponding folder in the schedule/processed directory of our bucket.