Closed lauriemerrell closed 2 years ago
Realized the fix here might be better as changing calitp_hash!=prev_calitp_hash OR is_first_extraction
to calitp_hash!=COALESCE(prev_calitp_hash,"") OR is_first_extraction
. Did some testing:
2021-12-29
, so there literally aren't any other instances of this situation where the ID/URL combo appeared, disappeared, reappearedHowever, if we do actually fix #1018 before #521, this would lead to the new, correct AC Transit data not appearing in views
. (If we fix both #1018 and #521, I think this would resolve itself.)
Double however, if we push a change here before #1018 is fixed, it would just result in the incorrect Capitol Corridor data appearing in views
, which we don't actually want either.
So, not sure what to do here. Fully correct answer seems to be just deleting the bad data from #1018 and #521. Not sure if we'd want to make a fix here to guard against future cases of this...?
There is an issue in
gtfs_views_staging.calitp_feeds
that means that in any case where we have an ID/URL combination inagencies.yaml
, that combination is removed for some period of time but the actual data that was pulled under the wrong label is not removed from/relabeled in the warehouse, and then the ID/URL combination is added back again toagencies.yaml
, the data pulled under the re-added ID/URL combo will not appear inviews
until the ITP ID's feed information (URL or name, etc.) changes a second time post re-add.Example / details
The
calitp_itp_id
4 (AC Transit) /calitp_url_number
1 combination has had a weird history:agencies.yaml
However, while the pair was assigned to the Capitol Corridor URL no data about it flowed into
views
.The Capitol Corridor data is present in all the relevant
gtfs_schedule_history
data. It falls out of the pipeline at the creation ofgtfs_views_staging.calitp_feeds
.Checking each of the subqueries:
gtfs_schedule_feed_snapshot
- CC data is presentfeed_id_cross_date
- CC data is presentlag_md5_hash
- CC data is present -- correctly shows data deleted in May when Sage Stage issue was first fixed, data missing May - Dec, then data resumes in Dechash_check
- CC data is present but theis_changed
flag isnull
on2021-12-29
(the date that the CC data starts being collected under this ID) and false subsequently, until it istrue
on2022-02-02
(the date that the assignment was fixed)feed_snapshot_with_removed
is where the CC data falls out -- theis_removed
flag staystrue
up until2022-12-28
, correctly, but then there is nothing until2022-02-02
.So, the problem is in
hash_check
. Basically, the logic here didn't account for the case where an ID/URL combo is present, then is removed for some period of time, and then is reinstated.Specifically,
previous_hash
isnull
on2021-12-29
andis_first_extraction
is alsofalse
, sois_changed
is not populated for this row. Weirdlycalitp_hash!=prev_calitp_hash
does not trigger astrue
when one is null and the other is populated?I think we could add a hacky check for
OR ((calitp_hash IS NULL) <> (prev_calitp_hash IS NULL))
or something here...The question is whether we think we actually need to support this case: Is there a legitimate situation where data would be removed and then reinstated? If we remove the underlying raw data from the wrong location, I think this would clear up on its own. But if we have cases like this again where something was done in error in the past, that error was fixed, and then new correct data is added, the correct data will not be picked up immediately. (Like, in this case it's just a fluke that the re-added data was also wrong.)
Originally posted by @lauriemerrell in https://github.com/cal-itp/data-infra/issues/1018#issuecomment-1029484292