cal-itp / data-infra

Cal-ITP data infrastructure
https://docs.calitp.org/data-infra
GNU Affero General Public License v3.0
48 stars 13 forks source link

Feature: Calculate each trip's likely arrival at each stop #2684

Open lauriemerrell opened 1 year ago

lauriemerrell commented 1 year ago

As part of #2489, I ran into a barrier on how best to summarize the trip updates / stop times updates arrival_time and departure_time fields at the trip level. I think that the actual best way to do this would be something like what was done to create the ad-hoc table cal-itp-data-infra.mart_ad_hoc.fct_final_trip_updates_arrival_departure_times_by_trip_stop_20230315_to_20230321 (SQL below) -- basically, get the final update for each stop, and assume that's the "actual" arrival/departure time. (Perhaps this should be filtered to where trip_update_timestamp <= arrival_time.)

The best way to summarize arrival/departure times at the trip level would probably be the "actual" arrival and departure times for the first & last stops, rather than any kind of global min/max. To summarize that though we'd probably first want a dedicated model trying to get the actual arrival/departures from the full stop times updates model, so that is the intent of this ticket.

SQL from the RT accuracy project to create cal-itp-data-infra.mart_ad_hoc.fct_final_trip_updates_arrival_departure_times_by_trip_stop_20230315_to_20230321 :

    SELECT DISTINCT
        gtfs_dataset_key,
        base64_url,
        service_date,
        trip_id,
        stop_id,
        trip_start_date,
        trip_start_time,
        trip_direction_id,
        trip_route_id,
        trip_schedule_relationship,
        DATETIME(TIMESTAMP_SECONDS(LAST_VALUE(arrival_time IGNORE NULLS) OVER(PARTITION BY base64_url, service_date, trip_id, trip_start_date, trip_start_time, stop_id ORDER BY COALESCE(trip_update_timestamp, header_timestamp) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)), "America/Los_Angeles") AS last_trip_updates_arrival_time,
        DATETIME(TIMESTAMP_SECONDS(LAST_VALUE(departure_time IGNORE NULLS) OVER(PARTITION BY base64_url, service_date, trip_id, trip_start_date, trip_start_time, stop_id ORDER BY COALESCE(trip_update_timestamp, header_timestamp) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)), "America/Los_Angeles") AS last_trip_updates_departure_time,
          LAST_VALUE(schedule_relationship IGNORE NULLS) OVER(PARTITION BY base64_url, service_date, trip_id, trip_start_date, trip_start_time, stop_id ORDER BY COALESCE(trip_update_timestamp, header_timestamp) ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS stop_schedule_relationship
      FROM `cal-itp-data-infra.mart_ad_hoc.fct_stop_time_updates_20230315_to_20230321`
lauriemerrell commented 1 year ago

I think this is relevant to @owades, leaving open because I think this feature is still needed.