Closed lauriemerrell closed 2 years ago
Two causes identified:
LEAD
calitp_extracted_at
, we are not PARTITION BY
calitp_url_number
. So calitp_extracted
at is being comingled across URLs. The rows affected by this are those belonging to ITP IDs that have multiple URLs (the two examples that we were seeing yesterday, 10
and 4
, are both in that category). This is easy to fix. gtfs_loader.gtfs_validation_history_load
, we are loading validations every day (even if the actual GTFS data did not change) even though we say (in a comment) that we are only getting updated data. This one is probably more of a design decision on whether we think there's a valid reason to do this, so I will make a new ticket.
While preparing the March reports, @Nkdiaz and I noticed some odd behavior with
gtfs_schedule_type2.validation_notices
. There are a large number (9,564,658 out of the total 70,346,207 = ~14%) of rows wherecalitp_extracted_at
=calitp_deleted_at
.This should not happen; there should always be at least one full day between extraction and deletion.
Furthermore, this causes problems in the reports because we look for rows that satisfy this condition:
(lhs.calitp_extracted_at <= rhs.date) & (func.coalesce(lhs.calitp_deleted_at, "2099-01-01") > rhs.date)
. For rows wherecalitp_extracted_at = calitp_deleted_at
, it is impossibly to satisfy the condition (because of the strict inequality on the second condition).AC for this ticket:
calitp_extracted_at
andcalitp_deleted_at
ingtfs_schedule_type2.validation_notices