cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

gtfs_views_staging.calitp_feeds does not handle feed removal + gap + reinstatement correctly #1030

Closed lauriemerrell closed 2 years ago

lauriemerrell commented 2 years ago

There is an issue in gtfs_views_staging.calitp_feeds that means that in any case where we have an ID/URL combination in agencies.yaml, that combination is removed for some period of time but the actual data that was pulled under the wrong label is not removed from/relabeled in the warehouse, and then the ID/URL combination is added back again to agencies.yaml, the data pulled under the re-added ID/URL combo will not appear in views until the ITP ID's feed information (URL or name, etc.) changes a second time post re-add.

Example / details

The calitp_itp_id 4 (AC Transit) / calitp_url_number 1 combination has had a weird history:

However, while the pair was assigned to the Capitol Corridor URL no data about it flowed into views.

The Capitol Corridor data is present in all the relevant gtfs_schedule_history data. It falls out of the pipeline at the creation of gtfs_views_staging.calitp_feeds.

Checking each of the subqueries:

So, the problem is in hash_check. Basically, the logic here didn't account for the case where an ID/URL combo is present, then is removed for some period of time, and then is reinstated.

Specifically, previous_hash is null on 2021-12-29 and is_first_extraction is also false, so is_changed is not populated for this row. Weirdly calitp_hash!=prev_calitp_hash does not trigger as true when one is null and the other is populated?

I think we could add a hacky check for OR ((calitp_hash IS NULL) <> (prev_calitp_hash IS NULL)) or something here...

The question is whether we think we actually need to support this case: Is there a legitimate situation where data would be removed and then reinstated? If we remove the underlying raw data from the wrong location, I think this would clear up on its own. But if we have cases like this again where something was done in error in the past, that error was fixed, and then new correct data is added, the correct data will not be picked up immediately. (Like, in this case it's just a fluke that the re-added data was also wrong.)

Originally posted by @lauriemerrell in https://github.com/cal-itp/data-infra/issues/1018#issuecomment-1029484292

lauriemerrell commented 2 years ago

Realized the fix here might be better as changing calitp_hash!=prev_calitp_hash OR is_first_extraction to calitp_hash!=COALESCE(prev_calitp_hash,"") OR is_first_extraction. Did some testing:

However, if we do actually fix #1018 before #521, this would lead to the new, correct AC Transit data not appearing in views. (If we fix both #1018 and #521, I think this would resolve itself.)

Double however, if we push a change here before #1018 is fixed, it would just result in the incorrect Capitol Corridor data appearing in views, which we don't actually want either.

So, not sure what to do here. Fully correct answer seems to be just deleting the bad data from #1018 and #521. Not sure if we'd want to make a fix here to guard against future cases of this...?