Bug: the data pipeline should be able to parse incomplete files

evansiroky commented 2 years ago

Describe the bug

Amtrak's new (to us) GTFS file broke the data pipeline. It appears that the stops.txt file is incomplete:

For some reason, the CSV parsing in gtfs_loader.gtfs_schedule_history_load did not get flagged as having an error with a row not having enough columns, so this file was marked as being OK. This resulted in the gtfs_loader.gtfs_schedule_tables_load and gtfs_schedule_history2.merge_updates failing due to there being an unfortunate value of the very last item written to the stops.txt file.

To Reproduce

See error logs for gtfs_loader.gtfs_schedule_tables_load and gtfs_schedule_history2.merge_updates.

Expected behavior

The data pipeline should not have a fatal error upon any incomplete file(s) that are included in a GTFS feed, but should flag such problems accordingly.

Additional context

We may end up needing to modify gtfs_loader.gtfs_schedule_history_load. A similar effort was reported in #691, #693, and sort of resolved with #716 and finally fixed with #872. Once a fix has been made and deployed to airflow, we should clear the affected days to get data flowing again.

lauriemerrell commented 2 years ago

Also related #1015

lauriemerrell commented 2 years ago

Thinking out loud --

The column in question is not "missing" from pandas's perspective because it's - (i.e., there is a string value present).
As far as gtfs_schedule_history failing goes, this is actually a side effect of #1139 -- we enforce a data type on gtfs_schedule before that has actually been established by the pipeline. So the gtfs_loader.gtfs_schedule_tables_load portion of this issue will be addressed by in-progress work on #1137.
- The error that is causing this task to fail is that gtfs_schedule.stops (the external table) has a data type expectation for the stop_lon field that is violated when this data is inserted.
Similar story for merge_updates -- we assert that stop_lon is a float but nothing actually enforces that until the merge.
I can hotfix this by just removing that assertion because in general we are not asserting any data types at that stage (even integer columns like in calendar are just treated as strings.)
I think the correct place to be handling all data typing concerns is in the clean tables in gtfs_views_staging, as part of #1137.

Question: What do we want to happen with these? Should the faulty data be just converted to null? (i.e., use SAFE_CAST as part of GTFS views staging?) Or should the row be dropped?

cc @atvaccaro

lauriemerrell commented 2 years ago

Slight amendment on the above -- nothing checks the typing on an external table until you actually access that table. So I think that changing just gtfs_schedule_history (and then enforcing the typing later) should work. I don't think that gtfs_schedule actually has data types defined anywhere, I think they are just derived from gtfs_schedule_history.

atvaccaro commented 2 years ago

I would probably discard in this case, since the CSV line itself is malformatted.

Does pandas CSV parsing throw an error on incomplete rows?

lauriemerrell commented 2 years ago

Talked to @atvaccaro offline. We determined:

Pandas can throw an error for a line with too many commas (cc #1015) but not too few.... I personally think that too few commas is at least kind of supported as a "lazy" CSV with the implication that all subsequent fields are blank.
We are going to go ahead with the hotfix for now, with the understanding that the corrupt row will make it into views at least temporarily, with stop_lon null. Longer term questions moved to #1152.

cal-itp / data-infra

Bug: the data pipeline should be able to parse incomplete files #1148