cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Bug: Long Beach Transit (170) extracted at/deleted at fields empty in gtfs_schedule_fact_daily_trips #940

Closed edasmalchi closed 2 years ago

edasmalchi commented 2 years ago

Describe the bug When filtering for Long Beach Transit (itp id 170), calitp_extracted_at and calitp_deleted_at are empty in gtfs_schedule_fact_daily_trips

To Reproduce Steps to reproduce the behavior:

  1. Open gtfs_schedule_fact_daily_trips in Metabase and filter for itp id 170, view blank columns for calitp_extracted_at and calitp_deleted_at
  2. Or using siuba: tbl.views.gtfs_schedule_fact_daily_trips() >> filter(_.calitp_itp_id == 170)

Expected behavior calitp_extracted_at and calitp_deleted_at data present, which is helpful for analysts to run faster/smaller queries on a time period of interest.

Additional context itp_ids 16, 48, 208, and 372 may have a similar issue.

Nkdiaz commented 2 years ago

Research Journey:

For the purposes of this ticket I then started to research calitp_id 16, 48, 208, 372:

So it seems for calitp_id 170, 16, and 48 we are recieving calendar_dates.txt rather than calendar.txt which populates calitp_id_extracted and calitp_id_deleted at as null in gtfs_schedule_fact_daily_trips

Nkdiaz commented 2 years ago

After discussing at the standup, per the gtfs guidelines, agencies are only required to send either the calendar.txt or calendar_dates.txt. https://developers.google.com/transit/gtfs/reference#calendartxt so it is something we should handle on our end. So looking deeper into the select statements in the date_includion from the calendar_dates, we do no select calitp_extracted_at/deleted_at as part of the data to include so it is not present on the right side of the join which means that it does not get populated in the full join lower down. So by changing adding those to the above subtable then 170 gets populated