🐛📄 – trips_performed primary key {date, trip_id} is insufficient to identify trips on frequency based schedules

Describe the problem

date and trip_id_performed are necessary but not sufficient to uniquely identify a trip performed, which may require a trip start_time if the GTFS feed contains frequencies.txt.

For example:

GTFS trips.txt	route_id	service_id	trip_id	direction_id	shape_id
r1	weekday	trip1	0	r1p1

GTFS frequencies.txt	trip_id	start_time	end_time	headway_secs	exact_times
trip1	8:00	16:00	600	1

trip1 is any of the trips on route r1, direction 0, service_id weekday that start between 8:00 and 16:00 (40 trips).

Possible Solutions

Recommended: add start_time to the fields and primaryKey of trips_performed, making it required.
or, make start_time conditionally required if GTFS contains frequencies.txt. No change to the primary key.
or, require that when exact_times is true, frequencies.txt trips are expanded into the stop_times table. But this doesn't work when exact_times is false.

Update: 2022-04-19 The consensus is to continue to use the trip_id_performed and service_date, but to update the description of trip_id_performed to clarify that it doesn't need to be the same as the scheduled trip_id, and that it needs to be unique within a service_date. E.g., two or more vehicles performing the same scheduled trip would all have different values of trip_id_performed.

trip_id_performed is meant to be a unique ID generated by the AVL system for the trip performed, not the scheduled trip. This unique ID should be unique across all trips of all routes and all dates for the transit agency. In general, trip_id_performed should not equal the schedule trip ID. The trips_performed table has a trip_id_scheduled for that purpose. If for some reason the AVL system does not generate trip IDs, then the converter to TIDES format should create a surrogate ID. In the example of a scheduled trip specified in GTFS frequencies.txt, many operated trips, each with their unique operated ID, would all point to the same scheduled ID. And even in the simpler case of a typical scheduled trip at a specific time, there will be many operated trips pointing to the same scheduled trip... once for each day that scheduled trip is operated.

Thanks for these insights, the description of the field in the report isn't super clear how this field needs to be unique, although it is implied by the primary key. I'm concerned that the way it was intended to work is a little too rigid, and that there may be an easier and more consistent way.

For one, trips_performed is supposed to be a summary table. To me, summary means that a table can be summarized from event tables and the GTFS schedule. From figure 4 (page 22) I see that trips_performed is the only summary table that doesn't have that property, and is not a pure summary table. It also looks like trip_id_scheduled is the only such field that prevents trips_performed from being a true summary table. One solution is to add trip_id_scheduled to the event tables. This would make trips_performed a true summary table, because it wouldn't depend on information unavailable in the event tables or the GTFS feed.

Finally, there's the issue of how CAD/AVL systems behave and how that differs from assumptions in the report. Our system (Transit Master) reports a new or replacement vehicle as operating the original scheduled trip_id, so it would seem that we'd have to create a composite trip_id_performed composed of trip_id_scheduled, vehicle_id and possibly trip start_time. Why not just add those to the primary key? I don't see a lot of advantages to having the requirement for a unique trip_id_performed over adding additional fields to the primary key.

So far I see two options:

require a converter for vendor outputs to ensure that trip_id_performed is unique
or, add additional fields to the primary key, possibly vehicle_id and actual_trip_start, essentially building a unique trip_id_performed from those additional fields. This would allow direct interpretation of trip_ids when systems use unique IDs for service as run, and those that re-use the scheduled trip_id, without requiring an extra step in the converter.

In either case, I think it's desirable to have trips_performed be a true summary table that can be generated from event tables, and I think that would require adding trip_id_scheduled to the event tables.

Here's an abbreviated sample of some stop-crossing records where two vehicles were serving the same scheduled trip. Our system (Transit Master) records and reports only the scheduled trip_id as the trip_ids of the trips as run:

PATTERN_GEO_NODE_SEQ	CALENDAR_ID	ROUTE_DIRECTION_ID	PATTERN_ID	GEO_NODE_ID	BLOCK_STOP_ORDER	SCHEDULED_TIME	ACT_ARRIVAL_TIME	ACT_DEPARTURE_TIME	ODOMETER	DAILY_WORK_PIECE_ID	TIME_POINT_ID	SERVICE_TYPE_ID	VEHICLE_ID	TRIP_ID	PULLOUT_ID	IsRevenue	SCHEDULE_TIME_OFFSET	CROSSING_TYPE_ID	OPERATOR_ID	CANCELLED_FLAG	IS_LAYOVER	ROUTE_ID
1	120200403	3	879705	21046	490	55200	54035	55471	2907	11354696	1435	6	2229	11109919	9139345	R	NA	2	4254	FALSE	TRUE	115815
4	120200403	3	879705	9036	493	55500	55618	55667	2946	11354696	786	6	2229	11109919	9139345	R	NA	0	4254	FALSE	FALSE	115815
6	120200403	3	879705	5010	495	55680	55725	55780	2981	11354696	541	6	2229	11109919	9139345	R	NA	0	4254	FALSE	FALSE	115815
12	120200403	3	879705	21948	501	56100	56080	56130	3057	11354696	561	6	2229	11109919	9139345	R	NA	0	4254	FALSE	FALSE	115815
18	120200403	3	879705	23490	507	56580	56585	56615	3193	11354696	2013	6	2229	11109919	9139345	R	NA	0	4254	FALSE	FALSE	115815
25	120200403	3	879705	17075	514	57060	56861	57917	3287	11354696	1125	6	2229	11109919	9139345	R	NA	1	4254	FALSE	TRUE	115815
4	120200403	3	879705	9036	493	55200	55006	55158	4056	11355948	786	6	2235	11109919	9140138	R	NA	0	6335	FALSE	FALSE	115815
6	120200403	3	879705	5010	495	55380	55216	55377	4091	11355948	541	6	2235	11109919	9140138	R	NA	0	6335	FALSE	FALSE	115815
12	120200403	3	879705	21948	501	55800	55573	55735	4172	11355948	561	6	2235	11109919	9140138	R	NA	0	6335	FALSE	FALSE	115815
18	120200403	3	879705	23490	507	56280	56202	56247	4311	11355948	2013	6	2235	11109919	9140138	R	NA	0	6335	FALSE	FALSE	115815

Another thing to consider is the trip_id of GTFS Realtime, which should have added ids for any trip added to the schedule (or generated from a frequency-based schedule)

For one, trips_performed is supposed to be a summary table. To me, summary means that a table can be summarized from event tables and the GTFS schedule. From figure 4 (page 22) I see that trips_performed is the only summary table that doesn't have that property, and is not a pure summary table. It also looks like trip_id_scheduled is the only such field that prevents trips_performed from being a true summary table.

I think of trips_performed as a summary table only because a trip is not an event at a particular time, but a collection of events that is "summarized" into a single row. This is consistent with the distinction made in page 16 of the report, in which event tables have one row per event, each at an instant in time. Even a serviced stop is considered a summary type in this regard, because arrival, departure, and each boarding and alighting are individual events happening while the stop is being visited. There may be a system that has simple GPS breadcrumb data for each vehicle, and not routes and trips, and in that case the individual events could be processed to distill stop visits and trips, but if you have AVL like Transit Master, it will give you "summary" data for each trip, for each (or at least some) stops, etc. So where summary data comes from is situation-dependent.

One solution is to add trip_id_scheduled to the event tables. This would make trips_performed a true summary table, because it wouldn't depend on information unavailable in the event tables or the GTFS feed.

I support your idea to add trip_id_scheduled to the vehicle_locations table as an optional field. There are AVL systems like Clever that report breadcrumbs or heartbeats, which are timestamped events, and those events come decorated with a trip_id. Not all systems do that, but some do. I don't know of any AVL system that records trips and doesn't generate stop-level and trip-level data files (considered "summary" files in TIDES), and this is probably why the TIDES spec omitted that field: if you have both, you can always join the events to the stops and trips based on vehicle ID and time, and get the scheduled trip ID that way. But there is no harm in having the option to identify the scheduled trip in vehicle events, and it could be convenient to avoid the fuzzy join.

Finally, there's the issue of how CAD/AVL systems behave and how that differs from assumptions in the report. Our system (Transit Master) reports a new or replacement vehicle as operating the original scheduled trip_id, so it would seem that we'd have to create a composite trip_id_performed composed of trip_id_scheduled, vehicle_id and possibly trip start_time.

Although those three columns can indeed serve to identify an operated trip, it's not an ideal primary key because not all operated trips are scheduled, and it is more efficient and direct to identify operated trips with an ID generated for that purpose, which most AVL systems have. Transit Master deployments I'm familiar with have a unique trip_id column in addition to the reference to a scheduled trip ID, block ID, etc. In a system that didn't have such an ID, the TIDES implementer should indeed create an ID, either by generating a surrogate serial number and keeping track of it, or by concatenating a few fields that are guaranteed to be unique in that system.

Why not just add those to the primary key? I don't see a lot of advantages to having the requirement for a unique trip_id_performed over adding additional fields to the primary key.

Because not all systems will identify trips using that combination of fields, many/most will use a single-field serial number to do so, and adding fields to the primary key would signify that the other fields aren't sufficient to identify a trip. Primary keys are generally defined with the minimum set of fields required to uniquely identify records, and in practice when there are many fields required, a surrogate ID is usually generated. Besides, requiring a multi-field key would complicate the software that consumes TIDES data.

So far I see two options:

1. require a converter for vendor outputs to ensure that trip_id_performed is unique

Definitely! And the converter will have to check many other things to comply with TIDES.

Our system (Transit Master) records and reports only the scheduled trip_id as the trip_ids of the trips as run

Are you sure? Maybe in another table, version of the table, etc. that you don't have access to? I ask only because if the system is keeping track of operated trips, it probably has an identifier for those internally, and it would be strange if it were left out of the file/table. It's conceivable that there really isn't an ID, in which case one should be generated.

Another thing to consider is the trip_id of GTFS Realtime, which should have added ids for any trip added to the schedule (or generated from a frequency-based schedule)

Here's how this currently works in GTFS-realtime, though the feature is still "experimental":

GTFS-realtime supports duplicating existing trips by setting schedule_relationship to DUPLICATED, adding a TripProperties message with trip_id (the new ID, not contained in the static feed, equivalent to trip_id_performed here), start_date and start_time. The TripProperties.trip_id field must be empty if schedule_relationship is not duplicated. All feed entities (regardless of schedule_relationship status) contain a TripDescriptor message, which contains trip_id, equivalent to our trip_id_scheduled.

Each GTFS-realtime TripUpdate for duplicated service contains both the unique trip_id_performed and the trip_id_scheduled, similar to my proposal above to add trip_id_scheduled to the event tables. In GTFS-realtime, trip_id_performed is expected to be the GTFS ID unless it's duplicated service.

I think there is value in matching the GTFS-realtime semantics as much as possible.

For one, trips_performed is supposed to be a summary table. To me, summary means that a table can be summarized from event tables and the GTFS schedule. From figure 4 (page 22) I see that trips_performed is the only summary table that doesn't have that property, and is not a pure summary table. It also looks like trip_id_scheduled is the only such field that prevents trips_performed from being a true summary table.

I think of trips_performed as a summary table only because a trip is not an event at a particular time, but a collection of events that is "summarized" into a single row. This is consistent with the distinction made in page 16 of the report, in which event tables have one row per event, each at an instant in time. Even a serviced stop is considered a summary type in this regard, because arrival, departure, and each boarding and alighting are individual events happening while the stop is being visited. There may be a system that has simple GPS breadcrumb data for each vehicle, and not routes and trips, and in that case the individual events could be processed to distill stop visits and trips, but if you have AVL like Transit Master, it will give you "summary" data for each trip, for each (or at least some) stops, etc. So where summary data comes from is situation-dependent.

One solution is to add trip_id_scheduled to the event tables. This would make trips_performed a true summary table, because it wouldn't depend on information unavailable in the event tables or the GTFS feed.

I support your idea to add trip_id_scheduled to the vehicle_locations table as an optional field. There are AVL systems like Clever that report breadcrumbs or heartbeats, which are timestamped events, and those events come decorated with a trip_id. Not all systems do that, but some do. I don't know of any AVL system that records trips and doesn't generate stop-level and trip-level data files (considered "summary" files in TIDES), and this is probably why the TIDES spec omitted that field: if you have both, you can always join the events to the stops and trips based on vehicle ID and time, and get the scheduled trip ID that way. But there is no harm in having the option to identify the scheduled trip in vehicle events, and it could be convenient to avoid the fuzzy join.

I'm moving the discussion of summarizing events into trips_performed to its own issue #70

GTFS-realtime supports duplicating existing trips by setting schedule_relationship to DUPLICATED, adding a TripProperties message with trip_id (the new ID, not contained in the static feed, equivalent to trip_id_performed here), start_date and start_time. The TripProperties.trip_id field must be empty if schedule_relationship is not duplicated. All feed entities (regardless of schedule_relationship status) contain a TripDescriptor message, which contains trip_id, equivalent to our trip_id_scheduled.

Each GTFS-realtime TripUpdate for duplicated service contains both the unique trip_id_performed and the trip_id_scheduled, similar to my proposal above to add trip_id_scheduled to the event tables. In GTFS-realtime, trip_id_performed is expected to be the GTFS ID unless it's duplicated service.

I think there is value in matching the GTFS-realtime semantics as much as possible.

I agree with the spirit of trying to match GTFS where possible, but in this case the approach is convoluted and messy. Maybe it makes sense for GTFS-realtime because it's purpose it to give riders real-time updates about schedule adjustments and when the bus is coming to their stop. It doesn't make sense for TIDES, since it's purpose is to standardize AVL/APC/AFC data. Moreover, TIDES should work cleanly and simply when we have data of operated trips without a schedule, whether it is because schedule data doesn't exist or because it isn't relevant for the analysis task at hand and therefore those working on it don't want to go through the work of converting schedules too.

Finally, there's the issue of how CAD/AVL systems behave and how that differs from assumptions in the report. Our system (Transit Master) reports a new or replacement vehicle as operating the original scheduled trip_id, so it would seem that we'd have to create a composite trip_id_performed composed of trip_id_scheduled, vehicle_id and possibly trip start_time.

Although those three columns can indeed serve to identify an operated trip, it's not an ideal primary key because not all operated trips are scheduled, and it is more efficient and direct to identify operated trips with an ID generated for that purpose, which most AVL systems have. Transit Master deployments I'm familiar with have a unique trip_id column in addition to the reference to a scheduled trip ID, block ID, etc. In a system that didn't have such an ID, the TIDES implementer should indeed create an ID, either by generating a surrogate serial number and keeping track of it, or by concatenating a few fields that are guaranteed to be unique in that system.

Ours does not provide a unique ID. The sample data I provided contains TM internal IDs, not our public facing GTFS IDs. But we've also identified examples in the TM generated GTFS-realtime TripUpdates feed with multiple vehicles assigned to the same trip_id. Which makes sense because until recently there really was no way to provide a unique trip_id in realtime that would mean anything, because there was no way to link it to an existing trip_id in GTFS static.

That's a great point, not all trips are scheduled. There are a few options, as reflected in the GTFS-realtime TripUpdates documentation, shown with their schedule_relationship value in all-caps:

trip runs as scheduled: SCHEDULED, TripDescriptor.trip_id MUST match a trip_id in GTFS trips.txt, or trip must be identified using: schedule_relationship = SCHEDULED, and TripDescriptor.{route_id, direction_id, start_time, start_date}. TripProperties.trip_id must not be populated.
trip runs on a route without a fixed schedule: UNSCHEDULED, uses GTFS frequencies.txt with exact_times = 0, identifying the trip requires TripDescriptor.{trip_id, start_date, start_time}. TripProperties.trip_id, the equivalent of trip_id_performed is not allowed.
trip added to the schedule: ADDED, basically deprecated and replaced by DUPLICATED. The ability to provide an entirely new trip not based on anything in the schedule is part of the GTFS-realtime ServiceChanges proposal. This ability is an issue for TIDES because we currently specify relationships to GTFS static only, and there would need to be a way to archive and reference the contents of GTFS-realtime ServiceChanges to access extended attributes about service (e.g., block_id, route_short_name, shape_dist_traveled), see #50.
trip removed from schedule: CANCELED, not run, won't appear in TIDES event tables.
trip copied from the schedule: DUPLICATED, the new trip uses an existing trip as a template and requires a new start_date/time, a unique trip_id not found in GTFS trips.txt (all from the TripProperties message) and the scheduled trip_id (TripDescriptor) found in trips.txt. Can also refer to a new shape that doesn't exist within GTFS shapes.txt.

GTFS-realtime TripProperties.trip_id, the equivalent of trip_id_performed is prohibited unless schedule_relationship is DUPLICATED, or potentially ADDED if/when GTFS realtime gains the ability to describe entirely new trips not related to the schedule.

Why not just add those to the primary key? I don't see a lot of advantages to having the requirement for a unique trip_id_performed over adding additional fields to the primary key.

Because not all systems will identify trips using that combination of fields, many/most will use a single-field serial number to do so, and adding fields to the primary key would signify that the other fields aren't sufficient to identify a trip. Primary keys are generally defined with the minimum set of fields required to uniquely identify records, and in practice when there are many fields required, a surrogate ID is usually generated. Besides, requiring a multi-field key would complicate the software that consumes TIDES data.

A trip doesn't exist unless a vehicle is serving it, and if it exists, it must have started on some date and time, so, vehicle_id and date (already required in passenger_events and vehicle_locations) and start_time will always exist for any row in trips_performed.

Can you explain how a multi-field key would complicate software that consumes TIDES data? I regularly use multi-field primary keys across a variety of vendor products (and GTFS) and find them quite useful.

So far I see two options:
1. require a converter for vendor outputs to ensure that trip_id_performed is unique
Definitely! And the converter will have to check many other things to comply with TIDES.

Our system (Transit Master) records and reports only the scheduled trip_id as the trip_ids of the trips as run

Are you sure? Maybe in another table, version of the table, etc. that you don't have access to? I ask only because if the system is keeping track of operated trips, it probably has an identifier for those internally, and it would be strange if it were left out of the file/table. It's conceivable that there really isn't an ID, in which case one should be generated.

I'm pretty sure. In the sample data a few fields distinguish trips operating under the same TRIP_ID, the most obvious and reliable is VEHICLE_ID, but DAILY_WORK_PIECE_ID, OPERATOR_ID and PULLOUT_ID also work.

As I noted previously, in GTFS-realtime the equivalent of trip_id_performed MUST NOT be provided if the schedule relationship is not duplicated. So the proposed table structure is very much at odds with existing practice, even if it is "experimental".

A trip doesn't exist unless a vehicle is serving it, and if it exists, it must have started on some date and time, so, vehicle_id and date (already required in passenger_events and vehicle_locations) and start_time will always exist for any row in trips_performed.

Yes, a compound primary key of (vehicle_id, date, start_time) without trip_id_scheduled would be a valid natural key. This would work for bus systems in almost all cases. (Hardware could glitch and generate two records for the same vehicle at the same time, but it would be reasonable for a TIDES processor to de-duplicate that, selecting the one that makes the most sense or combining records if necessary. This should be quite rare anyway.)

It could be a little bit more challenging (but not impossible) to apply this in rail systems, because in that case vehicles are formed each day by joining rail cars, and it is rail cars that have IDs, but it is not uncommon for the wayside detection equipment to not detect one or more cars. A vehicle location cleaner would have to deal with this. But rail systems often give a trip_id_operated that is independent of the cars detected, and helpful for analysis and for the cleaning process. Although rare, it is sometimes the case that a train is coupled or uncoupled mid-trip, in which case the natural key would change.

My sense, based on over a decade of experience dealing with data of multiple rail and bus systems, and having tried both approaches, is that the surrogate trip_id_performed is the better approach, but I do understand that if you don't see such a key generated by your system you would naturally prefer to create a key based on what you do have. At the risk of complicating the standard, could we say in the spec that trip_id_performed is the preferred primary key, but that if that is not available then the compound natural key is a valid alternative, at least in TIDES data generated by minimally processing vendor equipment data? (This data could later be processed and restated in the same format, fixing location/time issues and adding trip_id_performed.)

Can you explain how a multi-field key would complicate software that consumes TIDES data? I regularly use multi-field primary keys across a variety of vendor products (and GTFS) and find them quite useful.

The natural key would work conceptually (especially for bus), but it could get a bit more complicated in some of the cases I outlined above, and in some cases it would be problematic if the key included trip_id_scheduled.

As I noted previously, in GTFS-realtime the equivalent of trip_id_performed MUST NOT be provided if the schedule relationship is not duplicated. So the proposed table structure is very much at odds with existing practice, even if it is "experimental".

GTFS-realtime is designed to provide real-time data to customers on variations from the schedule as appearing in the trip planner and on real-time vehicle locations. It is not a good standard for bulk transit ITS data, and it was never meant to be that. The team that drafted TIDES looked at existing practice as it relates to bulk transit ITS data, and set out to build the best standard for that purpose. As it pertains to this issue, the idea of making the TIDES spec require that performed trips be identified by trip_id_scheduled when that works as a unique identifier in a day, but otherwise use another ID seems possible but awkward and conducive to bugs in user code, in light of how TIDES data will be used. Since the trip_id_performed that does work in cases of duplicate scheduled trips would sometimes be required, changing the spec wouldn't save TIDES implementors from having to handle that.

I would be curious to look at your bus AVL data sometime, including the Transit Master data dictionary provided for your implementation, all available lookup tables, etc.

Hello,

We have a use case, that I don't think is that unique, that would be relative to this discussion. It revolves around two types of service--extra service for schools, and special event service for sporting events. For the first case, we run extra service for as many as twenty (20) school districts across our nine garages, with each district having its own school calendar. Rather than try to maintain a precise calendar across HASTUS, TransitMaster, and GTFS-Static, we treat every weekday in TransitMaster as a school day and annul the trips as necessary. On the other hand, in GTFS every day is a non-school day, and no extra service appears in GTFS-Static due to the perceived calendar complexity issues. These trips actually have trip numbers that are distinct in TransitMaster, and are compatible with GTFS, so we would like to be able to pick these up through the GTFS-RT data, if possible.

The second issue is special events service. This is primarily service to regional sports teams. As started times can vary, and be changed at the last moment due to broadcasting requirements, we do not try to schedule the trips. However, we do have runs set up in HASTUS and TransitMaster that are dedicated to these services. Several buses can log into each run (perhaps up to 12 buses per run?). In this way, we at least have identifiers out as to the nature of the service. Can there be a way to capture this through GTFS-RT that can be fed through to TIDES? We have trouble capturing this data without currently capturing our own GTFS-RT data.

Thanks for looking at these two use cases in this discussion. I don't think the first one is that abnormal. Extra service is an issue for many transit agencies.

Regards, James Garner

On Fri, Sep 30, 2022 at 10:55 AM Joey Reid @.***> wrote:

Finally, there's the issue of how CAD/AVL systems behave and how that differs from assumptions in the report. Our system (Transit Master) reports a new or replacement vehicle as operating the original scheduled trip_id, so it would seem that we'd have to create a composite trip_id_performed composed of trip_id_scheduled, vehicle_id and possibly trip start_time.

Although those three columns can indeed serve to identify an operated trip, it's not an ideal primary key because not all operated trips are scheduled, and it is more efficient and direct to identify operated trips with an ID generated for that purpose, which most AVL systems have. Transit Master deployments I'm familiar with have a unique trip_id column in addition to the reference to a scheduled trip ID, block ID, etc. In a system that didn't have such an ID, the TIDES implementer should indeed create an ID, either by generating a surrogate serial number and keeping track of it, or by concatenating a few fields that are guaranteed to be unique in that system.

Ours does not provide a unique ID. The sample data I provided contains TM internal IDs, not our public facing GTFS IDs. But we've also identified examples in the TM generated GTFS-realtime TripUpdates feed with multiple vehicles assigned to the same trip_id. Which makes sense because until recently there really was no way to provide a unique trip_id in realtime that would mean anything, because there was no way to link it to an existing trip_id in GTFS static.

That's a great point, not all trips are scheduled. There are a few options, as reflected in the GTFS-realtime TripUpdates documentation https://gtfs.org/realtime/reference/#message-tripupdate, shown with their schedule_relationship value in all-caps:

trip runs as scheduled: SCHEDULED, TripDescriptor.trip_id MUST match a trip_id in GTFS trips.txt, or trip must be identified using: schedule_relationship = SCHEDULED, and TripDescriptor.{route_id, direction_id, start_time, start_date}. TripProperties.trip_id must not be populated.

trip runs on a route without a fixed schedule: UNSCHEDULED, uses GTFS frequencies.txt with exact_times = 0, identifying the trip requires TripDescriptor.{trip_id, start_date, start_time}. TripProperties.trip_id, the equivalent of trip_id_performed is not allowed.

trip added to the schedule: ADDED, basically deprecated and replaced by DUPLICATED. The ability to provide an entirely new trip not based on anything in the schedule is part of the GTFS-realtime ServiceChanges proposal. This ability is an issue for TIDES because we currently specify relationships to GTFS static only, and there would need to be a way to archive and reference the contents of GTFS-realtime ServiceChanges to access extended attributes about service (e.g., block_id, route_short_name, shape_dist_traveled), see #50 https://github.com/TIDES-transit/TIDES/issues/50.

trip removed from schedule: CANCELED, not run, won't appear in TIDES event tables.

trip copied from the schedule: DUPLICATED, the new trip uses an existing trip as a template and requires a new start_date/time, a unique trip_id not found in GTFS trips.txt (all from the TripProperties message) and the scheduled trip_id (TripDescriptor) found in trips.txt. Can also refer to a new shape https://gtfs.org/realtime/reference/#message-shape that doesn't exist within GTFS shapes.txt.

GTFS-realtime TripProperties.trip_id, the equivalent of trip_id_performed is prohibited unless schedule_relationship is DUPLICATED, or potentially ADDED if/when GTFS realtime gains the ability to describe entirely new trips not related to the schedule.

Why not just add those to the primary key? I don't see a lot of advantages to having the requirement for a unique trip_id_performed over adding additional fields to the primary key.

Because not all systems will identify trips using that combination of fields, many/most will use a single-field serial number to do so, and adding fields to the primary key would signify that the other fields aren't sufficient to identify a trip. Primary keys are generally defined with the minimum set of fields required to uniquely identify records, and in practice when there are many fields required, a surrogate ID is usually generated. Besides, requiring a multi-field key would complicate the software that consumes TIDES data.

A trip doesn't exist unless a vehicle is serving it, and if it exists, it must have started on some date and time, so, vehicle_id and date (already required in passenger_events and vehicle_locations) and start_time will always exist for any row in trips_performed.

Can you explain how a multi-field key would complicate software that consumes TIDES data? I regularly use multi-field primary keys across a variety of vendor products (and GTFS) and find them quite useful.

So far I see two options:

require a converter for vendor outputs to ensure that trip_id_performed is unique

Definitely! And the converter will have to check many other things to comply with TIDES.

Our system (Transit Master) records and reports only the scheduled trip_id as the trip_ids of the trips as run

Are you sure? Maybe in another table, version of the table, etc. that you don't have access to? I ask only because if the system is keeping track of operated trips, it probably has an identifier for those internally, and it would be strange if it were left out of the file/table. It's conceivable that there really isn't an ID, in which case one should be generated.

I'm pretty sure. In the sample data a few fields distinguish trips operating under the same TRIP_ID, the most obvious and reliable is VEHICLE_ID, but DAILY_WORK_PIECE_ID, OPERATOR_ID and PULLOUT_ID also work.

As I noted previously https://github.com/TIDES-transit/TIDES/issues/51#issuecomment-1263580908, in GTFS-realtime the equivalent of trip_id_performed MUST NOT be provided if the schedule relationship is not duplicated. So the proposed table structure is very much at odds with existing practice, even if it is "experimental".

— Reply to this email directly, view it on GitHub https://github.com/TIDES-transit/TIDES/issues/51#issuecomment-1263745544, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADGVYT7OHRTDZNIZRXFKB2DWA4ENXANCNFSM6AAAAAAQUAE6WI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

There was a lot of discussion on this issue and I want to try to resolve it. I believe the spec, as it stands, has the primary key for trips_performed as trip_id_performed +service_date. To me, this means that trip_id_performed must be unique across all trips on a given date. That is not quite consistent with @gabriel-korbato comment that trip_id_performed should be unique across all time. But I think that is perhaps too difficult to manage. My sense is that defining trip_id_performed as unique on a given service date is adequate. Are there further objections to this? If not, I'd like to close this issue.

I'm not sure we have resolved @mtnsguy use cases of school trips and event service. In both cases there are trips that do not have a scheduled_trip_id in GTFS. This is similar to the situation discussed in #50. As with that case, some other mechanism will be required to either modify the GTFS files to include those trips or add another data source with the added trip information. I don't believe that is a V1 milestone issue, so perhaps this should be broken out for further discussions or added into issue #50.

There was a lot of discussion on this issue and I want to try to resolve it. I believe the spec, as it stands, has the primary key for trips_performed as trip_id_performed +service_date. To me, this means that trip_id_performed must be unique across all trips on a given date. That is not quite consistent with @gabriel-korbato comment that trip_id_performed should be unique across all time. But I think that is perhaps too difficult to manage. My sense is that defining trip_id_performed as unique on a given service date is adequate. Are there further objections to this? If not, I'd like to close this issue.

We can't close this until the description field for trip_id_performed is updated to reflect the working definition of this field, it must be unique within a service_date and it does not need to be the same as the scheduled_trip_id. Anyone want to make a pull-request?

TIDES-transit / TIDES

🐛📄 – trips_performed primary key {date, trip_id} is insufficient to identify trips on frequency based schedules #51