Open jlstpaul opened 1 year ago
do we need a new field to indicate which reference schedule to use if there are more than one in effect on the same service date.
We discussed that putting a new field on every trip would be excessive. I think adding a table related to the entire TIDES data set (e.g., data_sources.csv or linked_datasets.csv), with an entry for each data source if there are multiple, might be a simpler approach.
I like the idea of a data_sources.csv
file. Possible field would include
data source identifier
(the name of the data source) start_date
end_date
source_type
(SCHEDULE or EVENT)in_service
(Boolean to match proposed in_service
field on vehicle_locations
and trips_performed
records)In the TIDES architecture, my understanding was that event tables should generally reflect data from a single source and that a summary process would be used to combine those together into summary tables. It would be the summarizing process that would need to identify which event tables may be referring to the same entity (i.e. trip)
In short: right now the datapackage.json
has the ability to describe the source (or multiple sources) of each data resource (table) which I think satisfies this basic need outlined here.
Within the context of a single service which is using GTFS, ODS, and GTFS Realtime to define all trip movements, all the trips should be completely described by the following three sources: ODS.deadheads.deadhead_id
(not guaranteed unique from trip_id
but perhaps this is something we should add as req. for ODS?) GTFS.trips.trip_id
and GTFS.TripUpdates.trip_id
(for newly added trips).
graph LR;
scheduler(((scheduler))) --> gtfs.trips[/gtfs.trips.trip_id/]
scheduler(((scheduler))) --> ods.deadheads[/ODS.deadheads.deadhead_id/]
replan(((CAD/replan))) -.- scheduler
replan ---> TripUpdate[/GTFS.TripUpdate.trip_id/]
Of course in practice there are:
trip_id
s which conflict across service_id
in GTFS + service_id
s in GTFS which aren't guaranteed to have run during the time window indicated (and thus can't be identified from a calendar/time combo)
...so we can't assume that these aren't conflicting.Having a crosswalk between all of these basic files above as a default seems desirable - leaving room for even more if it is necessary. Something super explicit like source.source_table.variable
gtfs.trips.service_id
gtfs.trips.trip_id
ods.deadheads.deadhead_id
hastus.whatever.trip_id
realtime.tripupdate.trip_id
gmv.gtfs.trips.trip_id
# because GMV often creates new GTFS which aligns with realtime
Between all of these the agency can decide how to assign a value for tides.trip_id but they have a crosswalk between them at the outset
Describe the feature you want and how it meets your needs or solves a problem As a creator of TIDES data, I want to link my TIDES data to different schedule data sources, even on the same service date. For example, I might have both a GTFS data set and an ODS data set. Or I may have two different GTFS data sets that don't combine well (i.e., they have conflicting identifiers).
Describe the solution you'd like When referring to a
trip_id
,stop_id
,stop_sequence
from the reference schedule, do we need a new field to indicate which reference schedule to use if there are more than one in effect on the same service date.Describe alternatives you've considered One alternative to a new field is if the schedule data files themselves are coordinated or combined in some way. The key challenge is that if I have a
trip_id
,stop_id
, orstop_sequence
, I need to know which data set to look them up in.Additional context and sample data One specific context of this issue is the need to link TIDES data to both GTFS data and ODS data. Is there another way to address this?