cal-itp / operational-data-standard

The Transit Operational Data Standard is an open standard for representing the transit schedules used by drivers, dispatchers, and planners to carry out transit operations.
https://ods.calitp.org
Apache License 2.0
26 stars 6 forks source link

new file: run_events.txt #60

Closed skyqrose closed 4 months ago

skyqrose commented 6 months ago

This is a combination of #51, #52, and #54, updated based on discussion and assuming some form of #55 is accepted.

I changed the name from #51, I now propose run_events.txt not runs.txt because the file has one row per event, not one row per run, and it merges in the existing run_events.txt table.

This is a concrete proposal, meant to close issues rather than open them. It's intended to be able to be accepted as is, without any open questions or TODOs. Though, of course, I expect there to be discussion and minor changes. I will edit the proposal with any changes.

Summary:

Full Documentation:

Primary Key: (service_id, run_id, event_sequence)

Field name Type Required Description (proposal comment)
service_id ID referencing calendar.service_id Required Identifies a set of dates when the run is scheduled to take place.
run_id ID Required
event_sequence Non-negative integer Required The order of this event within a run. Must be unique within one (service_id, run_id). It's required and unique so it can be used in the Primary Key to uniquely identify events. Note that events may overlap in time. If they do, it may not be possible to define a single ordering that's correct for all uses. This column provides one consistent ordering. If a consumer cares about how overlapping events are ordered, they should sort based on the time fields and event_type. If Event A and Event B are on the same `service_id` and `run_id`, and Event A has a `start_time` before Event B, then Event A's `event_sequence` should be less than Event B's. If Event A and B have the same `start_time`, but Event A has an `end_time` before Event B, then event A's `event_sequence` should be less than event B's. If Event A and B have the same `start_time` and `end_time`, then their `event_sequence` values can be in either order, but they must be different. Values do not have to be consecutive. Added after discussion.
piece_id ID Optional Identifies the piece within the run that the event takes place. May be blank if the event takes place out of a piece, like a break, or if the agency does not use piece ids.
block_id ID referencing trips.block_id Optional Identifies the block to which the run row belongs. If block_id exists, trip_id exists, and that trip's entry in trips.txt has a block_id, then the two block_ids must match. May exist even if trip_id does not (e.g. if an event represents a run-as-directed block with no scheduled trips).
job_type Text Optional The type of job that the employee is doing, in a human-readable format. e.g. "Assistant Conductor". Producers may use any values, but should be consistent. A single run may include more than one job_type throughout the day if the employee has multiple responsibilities, e.g. an "Operator" in the morning and a "Shifter" in the afternoon. Based on discussion in #54.
event_type Text Required The type of event that the employee is doing, in a human-readable format. e.g. "Sign-in". Producers may use any values, but should be consistent. Consumers may ignore events with an event_type that they don't recognize. Based on discussion in #54. Replaces run_events.event_type, which was a numeric enum with specific supported values. We could consider publishing a list of standard values to use here, for common activities such as "Sign-in", "Operator", and "Break", but producers should be able to use arbitrary values in addition to standard values. The field is Text rather than ID or Enum so that even if consumers don't understand the meaning of a specific event_type, they can still display it.
trip_id ID referencing trips.trip_id Optional If this run event corresponds to working on a trip, identifies that trip. No longer need separate trip and deadhead ids, because of #55
start_location ID referencing stops.stop_id Required Identifies where the employee starts this event. If trip_id is set (and mid_trip_start is not 1), this should be the first stop of the trip. If start_mid_trip is 1, this should instead be the location where the employee starts, in the middle of the trip. Location/time are always required, even for rows that correspond to trips, because it makes the spec simpler and consuming much easier.
start_time Time Required Identifies the time when the employee starts this event. If trip_id is set (and mid_trip_start is not 1), this should be the time of the first stop of the trip. If start_mid_trip is 1, this should instead be the time when the employee starts, in the middle of the trip.
start_mid_trip Enum Conditionally required Indicates whether the event begins at the start of the trip or in the middle of the trip. 0 (or blank) - Row does not start mid-trip 1 - Row starts mid-trip Required if the run event begins with a mid-trip relief. Optional otherwise. Recommended to leave this field blank if trip_id is not set.
end_location ID referencing stops.stop_id Required Identifies where the employee ends this event. If trip_id is set (and mid_trip_end is not 1), this should be the last stop of the trip. If end_mid_trip is 1, this should instead be the location where the employee ends, in the middle of the trip.
end_time Time Required Identifies the time when the employee ends this event. If trip_id is set (and mid_trip_end is not 1), this should be the time of the last stop of the trip. If end_mid_trip is 1, this should instead be the time when the employee ends, in the middle of the trip. Must be greater than or equal to start_time Relevant to discussion in #48: Note that this is time, not duration. Time makes more sense for a combined trips+events table, especially for mid-route reliefs. If we decide to add a minimum_duration field or something like it, that would be in addition to this field.
end_mid_trip Enum Conditionally required Indicates whether the event ends at the end of the trip or in the middle of the trip. 0 (or blank) - Row does not end mid-trip 1 - Row ends mid-trip Required if the run event ends with a mid-trip relief. Optional otherwise. Recommended to leave this field blank if trip_id is not set.

Examples

Single Run with Multiple Pieces and Pre-trip inspection

service_id,run_id,event_sequence,piece_id,block_id,job_type,event_type,trip_id,start_location,start_time,start_mid_trip,end_location,end_time,end_mid_trip
daily,10000,10,       ,       ,Operator,Report Time,        ,yard,09:45:00,,yard,09:45:00,
daily,10000,20,10000-1,       ,Operator,Pre-Trip Inspection,,yard,09:45:00,,yard,09:55:00,
daily,10000,30,10000-1,BLOCK-A,Operator,Pull-out,deadhead-1 ,yard,09:55:00,0,stop-1,09:58:00,0
daily,10000,40,10000-1,BLOCK-A,Operator,Operator,101        ,stop-1,10:00:00,0,stop-2,10:58:00,0
daily,10000,50,10000-1,BLOCK-A,Operator,Operator,102        ,stop-2,11:00:00,0,stop-1,11:58:00,0
daily,10000,60,       ,       ,Operator,Break,              ,stop-1,11:58:00,,stop-1,13:00:00,
daily,10000,70,10000-2,BLOCK-B,Operator,Operator,103        ,stop-1,13:00:00,0,stop-2,13:58:00,0
daily,10000,80,10000-2,BLOCK-B,Operator,Operator,104        ,stop-2,14:00:00,0,stop-1,14:58:00,0
daily,10000,90,10000-2,BLOCK-B,Operator,Pull-back,deadhead-2,stop-1,15:00:00,0,yard,15:03:00,0

Multiple Runs with Mid-Trip Relief

service_id,run_id,event_sequence,piece_id,block_id,job_type,event_type,trip_id,start_location,start_time,start_mid_trip,end_location,end_time,end_mid_trip
daily,10000,10,10000-1,BLOCK-A,Operator,Pull-out,deadhead-1 ,yard,09:55:00,0,stop-1,09:58:00,0
daily,10000,20,10000-1,BLOCK-A,Operator,Operator,101        ,stop-1,10:00:00,0,stop-2,10:58:00,0
daily,10000,30,10000-1,BLOCK-A,Operator,Operator,102        ,stop-2,11:00:00,0,mid-relief-stop,11:30:00,1
daily,20000,10,20000-1,BLOCK-B,Operator,Operator,102        ,mid-relief-stop,11:30:00,1,stop-2,13:58:00,0
daily,20000,20,20000-1,BLOCK-B,Operator,Pull-back,deadhead-2,stop-1,14:00:00,0,yard,14:03:00,0

Two-car MBTA Green Line train with an operator for each car. The event_type field distinguishes whether an operator is in the front car or the rear car. The operators swap for the return trip.

service_id,run_id,event_sequence,piece_id,block_id,job_type,event_type,trip_id,start_location,start_time,start_mid_trip,end_location,end_time,end_mid_trip
weekday,10000,10,,,Motorperson,Pilot  ,trip-1,stop-1,10:00:00,0,stop-2,10:58:00,0
weekday,10000,20,,,Motorperson,Trailer,trip-2,stop-2,11:00:00,0,stop-1,11:58:00,0
weekday,20000,10,,,Motorperson,Trailer,trip-1,stop-1,10:00:00,0,stop-2,10:58:00,0
weekday,20000,20,,,Motorperson,Pilot  ,trip-2,stop-2,11:00:00,0,stop-1,11:58:00,0

Edit history

jeffkessler-keolis commented 6 months ago

This largely looks good to me! Three suggested tweaks, and one comment:

Thanks!

skyqrose commented 6 months ago
  1. This is probably a good idea, but I'm still thinking through it. There is precedent in the GTFS docs for "Primary key (*)" and for using time fields in a primary key, which I hadn't realized before.
  2. Fixed
  3. This should be a general principle of all ODS files, and should be mentioned in documentation for how the _supplement files are used, not in every place that an ODS file has a reference.
  4. Overlapping:

The data I've used for MBTA has an "Operator" event that lasts for a whole piece, and overlaps with every trip on that piece. That event could be removed, so it's not a big deal. But an agency could use this to represent other whole-shift labels, like which part of the day an employee is getting paid for.

And if someone hypothetically really does have overlapping responsibilities, it'll be way better to represent them as two separate events as "job A" and "job B", instead of one combined "job A+B" event, which would break querying the data for anybody doing "job B".

It does mean that consumers need to understand the meaning of event_types in order to detangle overlapping events, but consumers have to know the meaning of event_types anyway to do anything useful with them.

I think it's important for flexibility in producers being able to represent their data in an accurate way, and is worth the added small complexity for consumers. But also I have a producer's point of view, so I'd love to hear others' opinions on this.

jfabi commented 6 months ago

Regarding (1), the run_event_id question, the main uses for the field I see are:

I think I'm in agreement that it's fine to not require run_event_id, or at least not for rows with a non-null trip_id, though I'd love to hear from any consumers on the matter!

BTollison commented 6 months ago

It seems like there could end up being 2 primary keys here, run_id and run_event_id. Can we just use run_id? I think it would be handy to also have a sequence field.

Also, it isn't clear to me why we need start_mid_trip and end_mid_trip if we always give the location at which we start or end.

skyqrose commented 6 months ago

mid_route:

I guess start/end_mid_route aren't necessary, and you could determine if it's mid_trip by comparing the location/time to the first/last of the trip's stop_times. But mid_route events are important, and comparing through trips.txt and stop_times.txt is hard. The field would concretely be useful at MBTA for handling bus operator schedules where we frequently have mid-route swing ons.

Primary Key:

I do think that some sort of id is important, I expect to want to have references into this file, and if the Primary Key is *, then that's impossible, you'd have to have all the columns.

Instead of run_event_id or *, the Primary Key could be (service_id, run_id, start_time, event_type). An employee could have multiple responsibilities at the same time, but probably wouldn't start the same event twice at the same time? I don't quite like this because using event_type as part of an id makes it less free as a free text field. If some agency ever does represent their data with two events at the same time, then the event_type can't be both unique and consistent.

run_id can't be a primary key on its own because one run has multiple rows.

A sequence field couldn't guaranteed be sequential because rows can overlap in time, and isn't needed for sequencing because consumers can work with the times instead. But maybe a sequence-ish field that's unique within a run would be useful for providing an order and an id? What do people think about this:

Field name Type Required Description
event_sequence Non-negative integer Required The order of this event within a run. Unique within the run. Note that events may overlap in time. If Event A and Event B are on the same `service_id` and `run_id`, and Event A has a `start_time` before Event B, then Event A's `event_sequence` should be less than Event B's. If Event A and B have the same `start_time`, but Event A has an `end_time` before Event B, then event A's `event_sequence` should be less than event B's. If Event A and B have the same `start_time` and `end_time`, then their `event_sequence` values can be in either order, but they must be different.

Primary Key: (service_id, run_id, event_sequence) (this is also the recommended sort order of the file).

skyqrose commented 4 months ago

Closing this issue since it's completely covered by #66 . Further discussion should happen there.