cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Bug: exhausted generator in code causes feed updates to skip loading #848

Closed machow closed 2 years ago

machow commented 2 years ago

In our gtfs_schedule_history_load we skip loading validation.json, since it is handled in another task (gtfs_schedule_validation_load). However, our use of a continue erroneously causes the entire feed loading loop to continue, skipping that feed's load. Affected days do not have processed data in gtfs-data/schedule/processed. This means they do not get included in our gtfs_schedule (latest) dataset, and our type 2 change tables.

Here is the line:

https://github.com/cal-itp/data-infra/blob/main/airflow/dags/gtfs_loader/gtfs_schedule_history_load.py#L63-L64

To fix

Query to get affected days:

SELECT calitp_itp_id, calitp_url_number, calitp_extracted_at FROM `cal-itp-data-infra.gtfs_schedule_history.calitp_files_updates`
WHERE is_agency_changed AND name="validation.json"
ORDER BY calitp_extracted_at DESC

We'll know we've correctly put the necessary files in processed when every row of the table gtfs_schedule_history.calitp_feed_updates has a corresponding folder in the schedule/processed directory of our bucket.

machow commented 2 years ago

This bug was surfaced by these issues: https://github.com/cal-itp/data-infra/issues/845, https://github.com/cal-itp/data-infra/issues/837

Here is a screencast of how I debugged https://www.loom.com/share/ec09a53e8e784d70b02aa65fe70cc136

machow commented 2 years ago

Looking again, I'm realizing the issue is different. Basically, we have two loops

However, the list of tables we iterate over in the inner loop is created using zip. Because zip is an iterator, and used outside of all the loops, it exhausts itself after the first feed, and so the rest of feeds don't do any processing. I think the remedy is the same as my original post, but we need to wrap zip in a list.

evansiroky commented 2 years ago

zip is an iterator

mind blown

evansiroky commented 2 years ago

So yeah, I definitely introduced this regression in 926fa255331d440a30b0031a189e0f542fd9fd3e. I saw that the table_details variable was seemingly being unnecessarily redeclared multiple times in a nested scope, so I naively thought I ought to move it to the block scope where its variables were defined. However, as noted, the zip command creates an iterator which creates problems later on.

I have gone ahead and made https://github.com/cal-itp/data-infra/pull/872 that should fix the problem I caused. Doing the asserts sounds like a good addition to make sure this doesn't happen again.

machow commented 2 years ago

Ah, thanks for the PR--it's definitely on me for not making the loader fail when feeds aren't there to load 😬.