cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
47 stars 12 forks source link

Decide handling for trip with two route_id values on same date #2703

Open lauriemerrell opened 1 year ago

lauriemerrell commented 1 year ago

In #2489, I add a test that a given trip only has one route_id on a given service date. At time of writing, there is one failure for Santa Cruz. This query shows where the same trip ID transitions from one route to another:

SELECT _extract_ts, DATETIME(_extract_ts, "America/Los_Angeles") AS extract_pacific, header_timestamp, DATETIME(header_timestamp, "America/Los_Angeles") AS header_pacific, trip_update_timestamp, DATETIME(trip_update_timestamp, "America/Los_Angeles") AS tu_pacific, _header_message_age, _trip_update_message_age, trip_id, id, vehicle_id, trip_route_id, stop_sequence, stop_id, arrival_time, departure_time
FROM `cal-itp-data-infra.mart_gtfs.fct_stop_time_updates`
WHERE base64_url = 'aHR0cHM6Ly9ydC5zY21ldHJvLm9yZy9ndGZzcnQvdHJpcHM=' AND dt = '2023-06-08' AND trip_id = '543080' AND _extract_ts BETWEEN '2023-06-08 07:56:00 UTC' AND '2023-06-08 17:52:40 UTC'
ORDER BY _extract_ts, stop_sequence

I am not sure how we would want to handle cases like this.

The root cause is that the route 18 activity should be associated with 6/7 (Pacific) and the route 69A activity should presumably be associated with 6/8 (Pacific), but the route 18 activity crosses the midnight boundary Pacific and doesn't have trip_start_date or trip_start_time populated to disambiguate it. The route 69A behavior also does not actually appear in the schedule, so that may also be a case of erroneously-assigned trip ID.

lauriemerrell commented 1 year ago

This is a valid bug, leaving open.