cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Bug: the data pipeline should be able to parse incomplete files #1148

Closed evansiroky closed 2 years ago

evansiroky commented 2 years ago

Describe the bug

Amtrak's new (to us) GTFS file broke the data pipeline. It appears that the stops.txt file is incomplete:

Screen Shot 2022-02-28 at 5 30 25 PM

For some reason, the CSV parsing in gtfs_loader.gtfs_schedule_history_load did not get flagged as having an error with a row not having enough columns, so this file was marked as being OK. This resulted in the gtfs_loader.gtfs_schedule_tables_load and gtfs_schedule_history2.merge_updates failing due to there being an unfortunate value of the very last item written to the stops.txt file.

To Reproduce

See error logs for gtfs_loader.gtfs_schedule_tables_load and gtfs_schedule_history2.merge_updates.

Expected behavior

The data pipeline should not have a fatal error upon any incomplete file(s) that are included in a GTFS feed, but should flag such problems accordingly.

Additional context

We may end up needing to modify gtfs_loader.gtfs_schedule_history_load. A similar effort was reported in #691, #693, and sort of resolved with #716 and finally fixed with #872. Once a fix has been made and deployed to airflow, we should clear the affected days to get data flowing again.

lauriemerrell commented 2 years ago

Also related #1015

lauriemerrell commented 2 years ago

Thinking out loud --

Question: What do we want to happen with these? Should the faulty data be just converted to null? (i.e., use SAFE_CAST as part of GTFS views staging?) Or should the row be dropped?

cc @atvaccaro

lauriemerrell commented 2 years ago

Slight amendment on the above -- nothing checks the typing on an external table until you actually access that table. So I think that changing just gtfs_schedule_history (and then enforcing the typing later) should work. I don't think that gtfs_schedule actually has data types defined anywhere, I think they are just derived from gtfs_schedule_history.

atvaccaro commented 2 years ago

I would probably discard in this case, since the CSV line itself is malformatted.

Does pandas CSV parsing throw an error on incomplete rows?

lauriemerrell commented 2 years ago

Talked to @atvaccaro offline. We determined: