new file: run_events.txt

This is a combination of #51, #52, and #54, updated based on discussion and assuming some form of #55 is accepted.

I changed the name from #51, I now propose run_events.txt not runs.txt because the file has one row per event, not one row per run, and it merges in the existing run_events.txt table.

This is a concrete proposal, meant to close issues rather than open them. It's intended to be able to be accepted as is, without any open questions or TODOs. Though, of course, I expect there to be discussion and minor changes. I will edit the proposal with any changes.

Summary:

A new file run_events.txt, which lists all of a run's trips, deadheads, and events.
Compared to previous versions, now working on a trip/deadhead is now just a normal event, but where the trip_id field is set.
This replaces the existing runs_pieces.txt and run_events.txt files. Those files would be removed. (deadheads.txt would also be removed by #55.)
See #51 for the list of issues that motivated this proposal.

Full Documentation:

Primary Key: (service_id, run_id, event_sequence)

Field name	Type	Required	Description	(proposal comment)
`service_id`	ID referencing `calendar.service_id`	Required	Identifies a set of dates when the run is scheduled to take place.
`run_id`	ID	Required
`event_sequence`	Non-negative integer	Required	The order of this event within a run. Must be unique within one (`service_id`, `run_id`). It's required and unique so it can be used in the Primary Key to uniquely identify events. Note that events may overlap in time. If they do, it may not be possible to define a single ordering that's correct for all uses. This column provides one consistent ordering. If a consumer cares about how overlapping events are ordered, they should sort based on the time fields and `event_type`. If Event A and Event B are on the same `service_id` and `run_id`, and Event A has a `start_time` before Event B, then Event A's `event_sequence` should be less than Event B's. If Event A and B have the same `start_time`, but Event A has an `end_time` before Event B, then event A's `event_sequence` should be less than event B's. If Event A and B have the same `start_time` and `end_time`, then their `event_sequence` values can be in either order, but they must be different. Values do not have to be consecutive.	Added after discussion.
`piece_id`	ID	Optional	Identifies the piece within the run that the event takes place. May be blank if the event takes place out of a piece, like a break, or if the agency does not use piece ids.
`block_id`	ID referencing `trips.block_id`	Optional	Identifies the block to which the run row belongs. If `block_id` exists, `trip_id` exists, and that trip's entry in `trips.txt` has a `block_id`, then the two `block_id`s must match. May exist even if `trip_id` does not (e.g. if an event represents a run-as-directed block with no scheduled trips).
`job_type`	Text	Optional	The type of job that the employee is doing, in a human-readable format. e.g. "Assistant Conductor". Producers may use any values, but should be consistent. A single run may include more than one `job_type` throughout the day if the employee has multiple responsibilities, e.g. an "Operator" in the morning and a "Shifter" in the afternoon.	Based on discussion in #54.
`event_type`	Text	Required	The type of event that the employee is doing, in a human-readable format. e.g. "Sign-in". Producers may use any values, but should be consistent. Consumers may ignore events with an `event_type` that they don't recognize.	Based on discussion in #54. Replaces `run_events.event_type`, which was a numeric enum with specific supported values. We could consider publishing a list of standard values to use here, for common activities such as "Sign-in", "Operator", and "Break", but producers should be able to use arbitrary values in addition to standard values. The field is `Text` rather than `ID` or `Enum` so that even if consumers don't understand the meaning of a specific `event_type`, they can still display it.
`trip_id`	ID referencing `trips.trip_id`	Optional	If this run event corresponds to working on a trip, identifies that trip.	No longer need separate trip and deadhead ids, because of #55
`start_location`	ID referencing `stops.stop_id`	Required	Identifies where the employee starts this event. If `trip_id` is set (and `mid_trip_start` is not `1`), this should be the first stop of the trip. If `start_mid_trip` is `1`, this should instead be the location where the employee starts, in the middle of the trip.	Location/time are always required, even for rows that correspond to trips, because it makes the spec simpler and consuming much easier.
`start_time`	Time	Required	Identifies the time when the employee starts this event. If `trip_id` is set (and `mid_trip_start` is not `1`), this should be the time of the first stop of the trip. If `start_mid_trip` is `1`, this should instead be the time when the employee starts, in the middle of the trip.
`start_mid_trip`	Enum	Conditionally required	Indicates whether the event begins at the start of the trip or in the middle of the trip. 0 (or blank) - Row does not start mid-trip 1 - Row starts mid-trip Required if the run event begins with a mid-trip relief. Optional otherwise. Recommended to leave this field blank if `trip_id` is not set.
`end_location`	ID referencing `stops.stop_id`	Required	Identifies where the employee ends this event. If `trip_id` is set (and `mid_trip_end` is not `1`), this should be the last stop of the trip. If `end_mid_trip` is `1`, this should instead be the location where the employee ends, in the middle of the trip.
`end_time`	Time	Required	Identifies the time when the employee ends this event. If `trip_id` is set (and `mid_trip_end` is not `1`), this should be the time of the last stop of the trip. If `end_mid_trip` is `1`, this should instead be the time when the employee ends, in the middle of the trip. Must be greater than or equal to `start_time`	Relevant to discussion in #48: Note that this is time, not duration. Time makes more sense for a combined trips+events table, especially for mid-route reliefs. If we decide to add a `minimum_duration` field or something like it, that would be in addition to this field.
`end_mid_trip`	Enum	Conditionally required	Indicates whether the event ends at the end of the trip or in the middle of the trip. 0 (or blank) - Row does not end mid-trip 1 - Row ends mid-trip Required if the run event ends with a mid-trip relief. Optional otherwise. Recommended to leave this field blank if `trip_id` is not set.

Multiple run_events can refer to the same trip_id, if multiple employees work on that trip.
Events may have gaps between the end time of one event and the start time of the next. E.g. if an operator's layovers aren't represented by an event.
Events may overlap in time, if an employee has multiple simultaneous responsibilities.
start_time may equal end_time for an event that's a single point in time (such as a report time) without any duration.
Recommended sort order: service_id, run_id, event_sequence.

Examples

Single Run with Multiple Pieces and Pre-trip inspection

service_id,run_id,event_sequence,piece_id,block_id,job_type,event_type,trip_id,start_location,start_time,start_mid_trip,end_location,end_time,end_mid_trip
daily,10000,10,       ,       ,Operator,Report Time,        ,yard,09:45:00,,yard,09:45:00,
daily,10000,20,10000-1,       ,Operator,Pre-Trip Inspection,,yard,09:45:00,,yard,09:55:00,
daily,10000,30,10000-1,BLOCK-A,Operator,Pull-out,deadhead-1 ,yard,09:55:00,0,stop-1,09:58:00,0
daily,10000,40,10000-1,BLOCK-A,Operator,Operator,101        ,stop-1,10:00:00,0,stop-2,10:58:00,0
daily,10000,50,10000-1,BLOCK-A,Operator,Operator,102        ,stop-2,11:00:00,0,stop-1,11:58:00,0
daily,10000,60,       ,       ,Operator,Break,              ,stop-1,11:58:00,,stop-1,13:00:00,
daily,10000,70,10000-2,BLOCK-B,Operator,Operator,103        ,stop-1,13:00:00,0,stop-2,13:58:00,0
daily,10000,80,10000-2,BLOCK-B,Operator,Operator,104        ,stop-2,14:00:00,0,stop-1,14:58:00,0
daily,10000,90,10000-2,BLOCK-B,Operator,Pull-back,deadhead-2,stop-1,15:00:00,0,yard,15:03:00,0

Multiple Runs with Mid-Trip Relief

service_id,run_id,event_sequence,piece_id,block_id,job_type,event_type,trip_id,start_location,start_time,start_mid_trip,end_location,end_time,end_mid_trip
daily,10000,10,10000-1,BLOCK-A,Operator,Pull-out,deadhead-1 ,yard,09:55:00,0,stop-1,09:58:00,0
daily,10000,20,10000-1,BLOCK-A,Operator,Operator,101        ,stop-1,10:00:00,0,stop-2,10:58:00,0
daily,10000,30,10000-1,BLOCK-A,Operator,Operator,102        ,stop-2,11:00:00,0,mid-relief-stop,11:30:00,1
daily,20000,10,20000-1,BLOCK-B,Operator,Operator,102        ,mid-relief-stop,11:30:00,1,stop-2,13:58:00,0
daily,20000,20,20000-1,BLOCK-B,Operator,Pull-back,deadhead-2,stop-1,14:00:00,0,yard,14:03:00,0

Two-car MBTA Green Line train with an operator for each car. The event_type field distinguishes whether an operator is in the front car or the rear car. The operators swap for the return trip.

service_id,run_id,event_sequence,piece_id,block_id,job_type,event_type,trip_id,start_location,start_time,start_mid_trip,end_location,end_time,end_mid_trip
weekday,10000,10,,,Motorperson,Pilot  ,trip-1,stop-1,10:00:00,0,stop-2,10:58:00,0
weekday,10000,20,,,Motorperson,Trailer,trip-2,stop-2,11:00:00,0,stop-1,11:58:00,0
weekday,20000,10,,,Motorperson,Trailer,trip-1,stop-1,10:00:00,0,stop-2,10:58:00,0
weekday,20000,20,,,Motorperson,Pilot  ,trip-2,stop-2,11:00:00,0,stop-1,11:58:00,0

Edit history

Renamed start/end_mid_route to start/end_mid_trip.
Noted that end_time must be >= start_time.
(2024-04-01) Replaced run_event_id with event_sequence. Updated Primary Key and examples.

This largely looks good to me! Three suggested tweaks, and one comment:

I'm not a fan of making run_event_id a required value since most operations would either be using an internal value from their scheduling system that'd never be referenced, or making up values on-the-fly that could easily get confused with other values. I think we're much better off making it an optional value as a result, and clarifying that this column must be dataset unique.
start_mid_route and end_mid_route should be renamed to either start_en_route and end_en_route or start_mid_trip and end_mid_trip to clarify that these need not be in the middle of a route, but rather the middle of a trip.
We should clarify that all start/end times and locations are based on the supplemented entries (per https://github.com/cal-itp/operational-data-standard/issues/55), not just the entries in the base GTFS. Thus, if a different stop_id, time, or value is used, or if a different row entirely is the first/last stop per the supplemented entry data, the values in here that reference GTFS should be matched/based on the values after their modification from the respective base/public GTFS.
From the producer side, I can't think of a scenario in which I'd ever want to have planned overlapping events, and I could see several consumers having issues processing overlapping events, hence I could see some issue with such permission. That said, this would not affect my realm, so this is more of a flag for input from others than something that would prevent my support.

Thanks!

This is probably a good idea, but I'm still thinking through it. There is precedent in the GTFS docs for "Primary key (*)" and for using time fields in a primary key, which I hadn't realized before.
Fixed
This should be a general principle of all ODS files, and should be mentioned in documentation for how the _supplement files are used, not in every place that an ODS file has a reference.
Overlapping:

The data I've used for MBTA has an "Operator" event that lasts for a whole piece, and overlaps with every trip on that piece. That event could be removed, so it's not a big deal. But an agency could use this to represent other whole-shift labels, like which part of the day an employee is getting paid for.

And if someone hypothetically really does have overlapping responsibilities, it'll be way better to represent them as two separate events as "job A" and "job B", instead of one combined "job A+B" event, which would break querying the data for anybody doing "job B".

It does mean that consumers need to understand the meaning of event_types in order to detangle overlapping events, but consumers have to know the meaning of event_types anyway to do anything useful with them.

I think it's important for flexibility in producers being able to represent their data in an accurate way, and is worth the added small complexity for consumers. But also I have a producer's point of view, so I'd love to hear others' opinions on this.

Regarding (1), the run_event_id question, the main uses for the field I see are:

Being able to have a handy and succinct list of all the events/trips that an employee/run will be working. You could do this instead by referencing all of the other columns, of course, so this may be less compelling.
Being able to reference and update events in real time and for recordkeeping and post-analyses. This would seem to all but require IDs, though an "ODS-EventUpdates" feed is still a number of steps in the future. Even in that scenario, I could see a direction where IDs (or a complex primary key) may be only be needed for for run events lacking a trip_id.

I think I'm in agreement that it's fine to not require run_event_id, or at least not for rows with a non-null trip_id, though I'd love to hear from any consumers on the matter!

It seems like there could end up being 2 primary keys here, run_id and run_event_id. Can we just use run_id? I think it would be handy to also have a sequence field.

Also, it isn't clear to me why we need start_mid_trip and end_mid_trip if we always give the location at which we start or end.

mid_route:

I guess start/end_mid_route aren't necessary, and you could determine if it's mid_trip by comparing the location/time to the first/last of the trip's stop_times. But mid_route events are important, and comparing through trips.txt and stop_times.txt is hard. The field would concretely be useful at MBTA for handling bus operator schedules where we frequently have mid-route swing ons.

Primary Key:

I do think that some sort of id is important, I expect to want to have references into this file, and if the Primary Key is *, then that's impossible, you'd have to have all the columns.

Instead of run_event_id or *, the Primary Key could be (service_id, run_id, start_time, event_type). An employee could have multiple responsibilities at the same time, but probably wouldn't start the same event twice at the same time? I don't quite like this because using event_type as part of an id makes it less free as a free text field. If some agency ever does represent their data with two events at the same time, then the event_type can't be both unique and consistent.

run_id can't be a primary key on its own because one run has multiple rows.

A sequence field couldn't guaranteed be sequential because rows can overlap in time, and isn't needed for sequencing because consumers can work with the times instead. But maybe a sequence-ish field that's unique within a run would be useful for providing an order and an id? What do people think about this:

Field name	Type	Required	Description
`event_sequence`	Non-negative integer	Required	The order of this event within a run. Unique within the run. Note that events may overlap in time. If Event A and Event B are on the same `service_id` and `run_id`, and Event A has a `start_time` before Event B, then Event A's `event_sequence` should be less than Event B's. If Event A and B have the same `start_time`, but Event A has an `end_time` before Event B, then event A's `event_sequence` should be less than event B's. If Event A and B have the same `start_time` and `end_time`, then their `event_sequence` values can be in either order, but they must be different.

Primary Key: (service_id, run_id, event_sequence) (this is also the recommended sort order of the file).

Closing this issue since it's completely covered by #66 . Further discussion should happen there.

cal-itp / operational-data-standard