cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Make more robust zipfile hashes in v2 pipeline #1931

Closed lauriemerrell closed 1 year ago

lauriemerrell commented 1 year ago

We have observed that there are a few feeds that seem to have the timestamps on their GTFS schedule files update every day even if the file contents did not update (presumably something to do with how the files are automatically produced/served to the website).

This results in more versions than we want in dim_schedule_feeds (i.e., one version per day, even if no data changed). We would like to attempt to set up more sophisticated versioning that checks each individual file in the feed (which won't show as changed if only a timestamp changed, unlike the md5 hash of the whole zip).

Basically, instead of versioning based on the md5 hash of the zipfile, we'd create a hash of each file, combine those, and then check that as the feed-level hash. We think this could be handled relatively simply by making a few changes here: https://github.com/cal-itp/data-infra/blob/main/airflow/dags/unzip_and_validate_gtfs_schedule/unzip_gtfs_schedule.py#L139-L141 in how the zipfile_md5_hash value is generated (i.e., keep the overall architecture the same, just change how that hashed value is generated.)

lauriemerrell commented 1 year ago

We (@atvaccaro and I) think this is moderately low priority at time of writing, the two feeds that are doing this aren't prohibitively large.

atvaccaro commented 1 year ago

I just realized today, we may need to do this for the reconstructed zipfiles as part of the Schedule backfill. Right now our file creation times are when the original job wrote the file to GCS.