Closed lauriemerrell closed 1 year ago
We (@atvaccaro and I) think this is moderately low priority at time of writing, the two feeds that are doing this aren't prohibitively large.
I just realized today, we may need to do this for the reconstructed zipfiles as part of the Schedule backfill. Right now our file creation times are when the original job wrote the file to GCS.
We have observed that there are a few feeds that seem to have the timestamps on their GTFS schedule files update every day even if the file contents did not update (presumably something to do with how the files are automatically produced/served to the website).
This results in more versions than we want in
dim_schedule_feeds
(i.e., one version per day, even if no data changed). We would like to attempt to set up more sophisticated versioning that checks each individual file in the feed (which won't show as changed if only a timestamp changed, unlike the md5 hash of the whole zip).Basically, instead of versioning based on the md5 hash of the zipfile, we'd create a hash of each file, combine those, and then check that as the feed-level hash. We think this could be handled relatively simply by making a few changes here: https://github.com/cal-itp/data-infra/blob/main/airflow/dags/unzip_and_validate_gtfs_schedule/unzip_gtfs_schedule.py#L139-L141 in how the zipfile_md5_hash value is generated (i.e., keep the overall architecture the same, just change how that hashed value is generated.)