cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Rows in gtfs_schedule_type2.validation_notices have calitp_extracted_at = calitp_deleted_at #1334

Closed lauriemerrell closed 2 years ago

lauriemerrell commented 2 years ago

While preparing the March reports, @Nkdiaz and I noticed some odd behavior with gtfs_schedule_type2.validation_notices. There are a large number (9,564,658 out of the total 70,346,207 = ~14%) of rows where calitp_extracted_at = calitp_deleted_at.

This should not happen; there should always be at least one full day between extraction and deletion.

Furthermore, this causes problems in the reports because we look for rows that satisfy this condition: (lhs.calitp_extracted_at <= rhs.date) & (func.coalesce(lhs.calitp_deleted_at, "2099-01-01") > rhs.date). For rows where calitp_extracted_at = calitp_deleted_at, it is impossibly to satisfy the condition (because of the strict inequality on the second condition).

AC for this ticket:

lauriemerrell commented 2 years ago

Two causes identified: