TIDES-transit / TIDES

Transit ITS Data Exchange Specification for historical transit operations data
https://tides-transit.github.io/TIDES
Apache License 2.0
25 stars 4 forks source link

📄🚀 – specify schedule data set when more than one is used on a single service date #151

Open jlstpaul opened 1 year ago

jlstpaul commented 1 year ago

Describe the feature you want and how it meets your needs or solves a problem As a creator of TIDES data, I want to link my TIDES data to different schedule data sources, even on the same service date. For example, I might have both a GTFS data set and an ODS data set. Or I may have two different GTFS data sets that don't combine well (i.e., they have conflicting identifiers).

Describe the solution you'd like When referring to a trip_id, stop_id, stop_sequence from the reference schedule, do we need a new field to indicate which reference schedule to use if there are more than one in effect on the same service date.

Describe alternatives you've considered One alternative to a new field is if the schedule data files themselves are coordinated or combined in some way. The key challenge is that if I have a trip_id, stop_id, or stop_sequence, I need to know which data set to look them up in.

Additional context and sample data One specific context of this issue is the need to link TIDES data to both GTFS data and ODS data. Is there another way to address this?

tsherlockcraig commented 1 year ago

do we need a new field to indicate which reference schedule to use if there are more than one in effect on the same service date.

We discussed that putting a new field on every trip would be excessive. I think adding a table related to the entire TIDES data set (e.g., data_sources.csv or linked_datasets.csv), with an entry for each data source if there are multiple, might be a simpler approach.

jlstpaul commented 1 year ago

I like the idea of a data_sources.csv file. Possible field would include

e-lo commented 1 year ago

In the TIDES architecture, my understanding was that event tables should generally reflect data from a single source and that a summary process would be used to combine those together into summary tables. It would be the summarizing process that would need to identify which event tables may be referring to the same entity (i.e. trip)

In short: right now the datapackage.json has the ability to describe the source (or multiple sources) of each data resource (table) which I think satisfies this basic need outlined here.

Within the context of a single service which is using GTFS, ODS, and GTFS Realtime to define all trip movements, all the trips should be completely described by the following three sources: ODS.deadheads.deadhead_id (not guaranteed unique from trip_id but perhaps this is something we should add as req. for ODS?) GTFS.trips.trip_id and GTFS.TripUpdates.trip_id (for newly added trips).

graph LR;
  scheduler(((scheduler))) --> gtfs.trips[/gtfs.trips.trip_id/]
    scheduler(((scheduler))) --> ods.deadheads[/ODS.deadheads.deadhead_id/]
    replan(((CAD/replan))) -.- scheduler
    replan ---> TripUpdate[/GTFS.TripUpdate.trip_id/]

Of course in practice there are:

Having a crosswalk between all of these basic files above as a default seems desirable - leaving room for even more if it is necessary. Something super explicit like source.source_table.variable

gtfs.trips.service_id gtfs.trips.trip_id ods.deadheads.deadhead_id hastus.whatever.trip_id realtime.tripupdate.trip_id gmv.gtfs.trips.trip_id # because GMV often creates new GTFS which aligns with realtime

Between all of these the agency can decide how to assign a value for tides.trip_id but they have a crosswalk between them at the outset