cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Check on handling of `MACOSX` hidden files in v2 pipeline #2104

Closed lauriemerrell closed 1 year ago

lauriemerrell commented 1 year ago

The query:

SELECT DISTINCT feed_key, original_filepath, original_filename, t1.base64_url, string_url, gtfs_dataset_name, download_success, t2.unzip_success
FROM `cal-itp-data-infra.mart_gtfs_quality.fct_schedule_feed_files` AS t1
LEFT JOIN `cal-itp-data-infra.mart_gtfs.dim_schedule_feeds` AS t2
ON t1.feed_key = t2.key
WHERE original_filepath LIKE '%MACOSX%'
ORDER BY gtfs_dataset_name, original_filepath

Returns a bunch of hidden files that were in __MACOSX directories but where unzip_success=true. In some sense this is desirable behavior (I don't think we want to fail parsing for hidden files), but I thought that the unzip job should fail for any directories identified; the problem is that the __MACOSX directory doesn't seem to be registering as a directory (I believe that in the unzip outcomes file, these have no zipfile directories found).

Purpose of this ticket is to define the task of just looking this and figuring out:

atvaccaro commented 1 year ago

Investigated this further today as we actually violated our 99% success threshold; it turns out that Python's ZipFile does NOT treat __MACOSX as a directory, so we were only alerting when there was another directory in addition to the __MAXOSX "directory". See https://gtfs.calitp.org/production/GuadalupeFlyerParatransitFlex.zip as an example of success and https://gtfs.calitp.org/production/HumboldtTransitAuthorityDialARideFlex.zip as an example of failure. I'm going to add special handling for __MAXOSX so we exclude it from our definition of validity, as well as start reporting all unzip errors to Sentry.