Closed evansiroky closed 2 years ago
Also related #1015
Thinking out loud --
-
(i.e., there is a string value present).gtfs_schedule_history
failing goes, this is actually a side effect of #1139 -- we enforce a data type on gtfs_schedule
before that has actually been established by the pipeline. So the gtfs_loader.gtfs_schedule_tables_load
portion of this issue will be addressed by in-progress work on #1137.
gtfs_schedule.stops
(the external table) has a data type expectation for the stop_lon
field that is violated when this data is inserted. merge_updates
-- we assert that stop_lon
is a float but nothing actually enforces that until the merge. calendar
are just treated as strings.) clean
tables in gtfs_views_staging
, as part of #1137. Question: What do we want to happen with these? Should the faulty data be just converted to null? (i.e., use SAFE_CAST
as part of GTFS views staging?) Or should the row be dropped?
cc @atvaccaro
Slight amendment on the above -- nothing checks the typing on an external table until you actually access that table. So I think that changing just gtfs_schedule_history
(and then enforcing the typing later) should work. I don't think that gtfs_schedule
actually has data types defined anywhere, I think they are just derived from gtfs_schedule_history
.
I would probably discard in this case, since the CSV line itself is malformatted.
Does pandas CSV parsing throw an error on incomplete rows?
Talked to @atvaccaro offline. We determined:
views
at least temporarily, with stop_lon
null. Longer term questions moved to #1152.
Describe the bug
Amtrak's new (to us) GTFS file broke the data pipeline. It appears that the stops.txt file is incomplete:
For some reason, the CSV parsing in
gtfs_loader.gtfs_schedule_history_load
did not get flagged as having an error with a row not having enough columns, so this file was marked as being OK. This resulted in thegtfs_loader.gtfs_schedule_tables_load
andgtfs_schedule_history2.merge_updates
failing due to there being an unfortunate value of the very last item written to the stops.txt file.To Reproduce
See error logs for gtfs_loader.gtfs_schedule_tables_load and gtfs_schedule_history2.merge_updates.
Expected behavior
The data pipeline should not have a fatal error upon any incomplete file(s) that are included in a GTFS feed, but should flag such problems accordingly.
Additional context
We may end up needing to modify
gtfs_loader.gtfs_schedule_history_load
. A similar effort was reported in #691, #693, and sort of resolved with #716 and finally fixed with #872. Once a fix has been made and deployed to airflow, we should clear the affected days to get data flowing again.