Right now, we don't check data types in a structured way in the GTFS schedule pipeline. In #1148 we were causing issues by setting data types in gtfs_schedule_history external table definitions that weren't actually checked or handled at all, so any violations caused a total pipeline failure.
The proper place to check and enforce data types is gtfs_views_staging where the _clean tables are created.
To do this, we should decide how to handle data type validations like what occurred in #1148 - as an interim there we just used SAFE_CAST so that the data made it into views with a null instead of the corrupt value.
Should rows with illegal data types be dropped? Or should we pass the nulls through?
If we drop rows, how do we want to handle the resulting foreign key violations?
Are there other fields we want to CAST? Right now it's just stop lat/long.
Inspired by #1148.
Right now, we don't check data types in a structured way in the GTFS schedule pipeline. In #1148 we were causing issues by setting data types in
gtfs_schedule_history
external table definitions that weren't actually checked or handled at all, so any violations caused a total pipeline failure.The proper place to check and enforce data types is
gtfs_views_staging
where the_clean
tables are created.To do this, we should decide how to handle data type validations like what occurred in #1148 - as an interim there we just used
SAFE_CAST
so that the data made it intoviews
with a null instead of the corrupt value.CAST
? Right now it's just stop lat/long.