Bug: exhausted generator in code causes feed updates to skip loading

machow commented 2 years ago

In our gtfs_schedule_history_load we skip loading validation.json, since it is handled in another task (gtfs_schedule_validation_load). However, our use of a continue erroneously causes the entire feed loading loop to continue, skipping that feed's load. Affected days do not have processed data in gtfs-data/schedule/processed. This means they do not get included in our gtfs_schedule (latest) dataset, and our type 2 change tables.

Here is the line:

https://github.com/cal-itp/data-infra/blob/main/airflow/dags/gtfs_loader/gtfs_schedule_history_load.py#L63-L64

To fix

~~change continue to pass~~: edit: see latest comment about our use of zip.
run an assert in gtfs_schedule_table_load using fs.exists to check that each entry in data_paths exists (this line)
re-run gtfs_schedule_history_load for any affected days. (see query below)
delete our type2 gtfs_schedule tables and reset gtfs_schedule_history2.merge_updates across all time

Query to get affected days:

SELECT calitp_itp_id, calitp_url_number, calitp_extracted_at FROM `cal-itp-data-infra.gtfs_schedule_history.calitp_files_updates`
WHERE is_agency_changed AND name="validation.json"
ORDER BY calitp_extracted_at DESC

We'll know we've correctly put the necessary files in processed when every row of the table gtfs_schedule_history.calitp_feed_updates has a corresponding folder in the schedule/processed directory of our bucket.

machow commented 2 years ago

This bug was surfaced by these issues: https://github.com/cal-itp/data-infra/issues/845, https://github.com/cal-itp/data-infra/issues/837

Here is a screencast of how I debugged https://www.loom.com/share/ec09a53e8e784d70b02aa65fe70cc136

machow commented 2 years ago

Looking again, I'm realizing the issue is different. Basically, we have two loops

outer: over all changed feeds
inner: over tables we'll process (for changed feeds)

However, the list of tables we iterate over in the inner loop is created using zip. Because zip is an iterator, and used outside of all the loops, it exhausts itself after the first feed, and so the rest of feeds don't do any processing. I think the remedy is the same as my original post, but we need to wrap zip in a list.

evansiroky commented 2 years ago

zip is an iterator

mind blown

evansiroky commented 2 years ago

So yeah, I definitely introduced this regression in 926fa255331d440a30b0031a189e0f542fd9fd3e. I saw that the table_details variable was seemingly being unnecessarily redeclared multiple times in a nested scope, so I naively thought I ought to move it to the block scope where its variables were defined. However, as noted, the zip command creates an iterator which creates problems later on.

I have gone ahead and made https://github.com/cal-itp/data-infra/pull/872 that should fix the problem I caused. Doing the asserts sounds like a good addition to make sure this doesn't happen again.

machow commented 2 years ago

Ah, thanks for the PR--it's definitely on me for not making the loader fail when feeds aren't there to load 😬.

cal-itp / data-infra

Bug: exhausted generator in code causes feed updates to skip loading #848

To fix