new file runs.txt, and associated changes

skyqrose commented 9 months ago

Background / Existing problems

There are several oversights in the existing ODS spec that make it impossible for MBTA to represent our schedule:

The existing runs_pieces.txt file does not provide a link between a run/piece and all its associated trips, deadheads, and run events.

Given a run_id or piece_id, which trips, deadheads, and events are on that run?
- You could check run_pieces.start_trip_id and run_pieces.end_trip_id to see which block is associated with a piece, but this is cumbersome.
Given a trip_id or deadhead_id, what run/piece is it on?
- For a data consumer (whether an CAD/AVL vendor or an agency analyst) provided with a trip or deadhead, it is hard to trace back which run/piece it operates on, requiring a review of all trips on the block and determining whether any of them appear in run_pieces.txt.
If a piece consists of more than two vehicle blocks, such as scheduled drop-backs where an operator's block changes after every trip, there is no way to connect an operator's run and trips.
Also, depending on one's interpretation of the specification, runs_pieces.start_trip_id and runs_pieces.end_trip_id may be null if a run piece both begins and ends with an event, making it impossible to use them to match a run to a block.
run_events.txt uses a piece_id field to make the association easy, but this field could not be added to trips.txt in GTFS, so we need a new way in ODS to make the connection.

The current runs_pieces.txt also does not allow for the representation of a piece that consists only of events, such as extraboard (aka, cover) or other run-as-directed work, because run_pieces.start_trip_id and run_pieces.end_trip_id are required fields and do not allow for run events.

Finally, the current specification does not allow for non-unique run or piece identifiers, even as many agencies may reuse run "numbers" between divisions or day-types.

A new file, runs.txt would address all of these problems.

new file: `runs.txt`

Primary key: (service_id, run_id, run_row_type, run_row_id)

This proposal uses (service_id, run_id) as a pair to solve the run-uniqueness problem.

Description

Lists all of the trips, deadheads, and events associated with each run and piece in a many-to-one relationship.

The start/end time/location of each run row are denormalized from trips.txt/stop_times.txt/deadheads.txt/deadhead_times.txt/runs_events.txt. They're needed here because knowing when and where someone is working is important, and checking any of three other files for it is too hard. It's also needed to show where within a mid-trip relief the relief is.

The mid_trip flag is 1 for trips with mid-trip relief. It means the start/end time/location are for this operator's work, not the start/end of the trip. The flag could be set to 0 to say "there's no mid-trip relief, and the operator's work in this file corresponds to the trip ends in stop_times.txt or deadhead_times.txt". Or it could be left blank. It should be blank for all events (type 2).

The times don't have to fit perfectly together, e.g. for a layover. The employee is considered to be on their run/piece between the start_time of the earliest row on that run/piece and the end_time of the last row. It's allowed for times to overlap, in the case that there's a run_event for a task that an employee does concurrently with driving.

Field name	Type	Required	Description
`service_id`	ID referencing `calendar.service_id`	Required	Identifies a set of dates when the run is scheduled to take place.
`run_id`	ID	Required
`piece_id`	ID	Optional	Identifies the piece during which the run row takes place. May be left null for rows that take place outside of a piece, such as a break. _\[Note: Only matters if allowed in [Proposal 2](https://github.com/cal-itp/operational-data-standard/issues/52)\]_
`block_id`	ID referencing `deadheads.block_id` or `trips.block_id`	Optional	Identifies the block to which the run row belongs. If omitted, this may be derived from `trips.txt` or `deadheads.txt`. If populated, this value must match that in `trips.txt` or `deadheads.txt`, for the given `trip_id` or `deadhead_id`.
`run_row_type`	Enum	Required	Indicates whether the run row consists of a deadhead, a revenue trip, or an event. 0 - Deadhead 1 - Trip 2 - Event
`run_row_id`	ID referencing `deadheads.deadhead_id` or `trips.trip_id` or `run_events.run_event_id`	Required	Identifies the specific deadhead, trip, or event associated with the run row.
`run_row_start_time`	Time	Conditionally required	Identifies the time at which the run piece begins to be associated with the row's deadhead, trip, or event. Required if `run_row_start_mid_trip` is 1. Recommended otherwise.
`run_row_start_location`	ID referencing `deadheads.deadhead_id` or `trips.trip_id` or `run_events.event_from_location_id`	Conditionally required	Identifies the first operational location or stop to be serviced by the run row. Required if `run_row_start_mid_trip` is 1. Recommended otherwise.
`run_row_start_mid_trip`	Enum	Conditionally required	Indicates whether the run piece begins the deadhead or trip at the start or middle of the respective deadhead or trip. 0 (or blank) - Row does not start mid-trip or mid-deadhead 1 - Row starts mid-trip or mid-deadhead Required if the run row begins with a mid-trip relief. Optional otherwise.
`run_row_end_time`	Time	Conditionally required	Identifies the time at which the run piece is finished being associated with the row's deadhead, trip, or event. Required if `run_row_end_mid_trip` is 1. Recommended otherwise.
`run_row_end_location`	ID referencing `deadheads.deadhead_id` or `trips.trip_id` or `run_events.event_to_location_id`	Conditionally required	Identifies the last operational location or stop to be serviced by the run row. Required if `run_row_end_mid_trip` is 1. Recommended otherwise.
`run_row_end_mid_trip`	Enum	Required	Indicates whether the run piece ends the deadhead and trip at the end or middle of the respective deadhead or trip. Used to denote mid-trip reliefs. 0 (or blank) - Row does not end mid-trip or mid-deadhead 1 - Row ends mid-trip or mid-deadhead Required if the run row ends with a mid-trip relief. Optional otherwise.

Question: Should the start/end time/location fields be required instead of conditionally required? It would make consuming easier to be able to rely on their presence, but could make producing more complex for agencies that don't use mid-trip reliefs.

remove file or add column: `runs_pieces.txt`

Option A: _Remove runs_pieces.txt_

All the information in runs_pieces.txt is now redundant with the information in runs.txt. We propose removing the file.

Option B: _Add column to runs_pieces.txt_

The file could be kept if:

We want a place to store information about a run/piece as a whole (which runs.txt can't do since it has one row per trip/deadhead/event, instead of one row per run/piece).
- As an example, finding the start/end time of a piece requires comparing multiple rows in runs.txt, but could be done with just one row if those fields were added to runs_pieces.txt.
We want to avoid the breaking change.

If the file is kept, we propose adding new field, service_id, to solve the run_id uniqueness problem:

Field name	Type	Required	Description
`service_id`	ID referencing `calendar.service_id`	Required	Identifies a set of dates when the run is scheduled to take place.

Adding this required field is still a breaking change, just a smaller one. (Though the breaking change could potentially be avoided with the run_code alternative below.)

We may also want to consider changing the start/end fields to better line up with runs.txt's start/end fields and make it clearer how to handle pieces that start with events, but I don't have a specific proposal for how to do that.

remove columns: deadheads.txt

Consider removing fields to_trip_id, from_trip_id, to_deadhead_id, from_deadhead_id.

This change isn't needed, but if we're making backwards incompatible changes anyway, this would clean things up and make the spec a little more cohesive.

These fields were originally added as a way to link deadheads to other trips on the run/piece/block. But runs.txt now provides a better way to find the order of trips and deadheads with a run. Also, the spec currently has some ambiguities around these fields. As a producer, it would be easier to remove these fields than to populate them.

These fields could be kept anyway if consumers find them useful, or if we want to minimize the number of breaking changes. If they are kept, they should all be made optional, as not every deadhead will have a previous/next trip.

[Note: We also propose other unrelated changes to deadheads.txt in Proposal 2.]

Non-recommended option: `run_code/piece_code`

An alternative solution that we considered for the run uniqueness problem, would be to add new String fields run_code and piece_code to runs.txt and/or runs_pieces.txt. Our human-readable non-unique run ids would go as strings in these fields, and run_id would have to be a long unique ID. (The MBTA would probably use something like ${service_id}-${division_id}-${run_id}).

This solves the uniqueness problem, so the new service_id field in runs_pieces.txt would not be required. If done just right, this could potentially make the whole proposal backwards-compatible.

However, I think it's better to keep using non-unique run_ids with a service_id field because:

Translating to/from the new uniquified `run_id`` would be a bit of a pain.
block_id is not unique, and it's basically the equal counterpart of run_id.
We think a service_id is a useful field to have in most ODS files anyway, so you can more easily query for data by date.
This proposal will be a lot cleaner if we're allowed to do backwards incompatible changes, and it looks like ODS 2.0 is leaning in that direction anyway.

Questions for review:

Is the new runs.txt file okay? Is there anything in your agency it wouldn't be able to represent, and would it be easy for parties to produce as well as consume?
Is removing runs_pieces.txt okay, or should we keep it?
Are the other small backwards-incompatible changes okay, or should we go for backwards-compatible alternatives?

jeffkessler-keolis commented 8 months ago

Hi Sky,

Thank you for this incredibly detailed and well-documented proposal!

It took me a couple of tries to wrap my head around it, but if I'm reading correctly, the tl;dr is replacing runs_pieces.txt — which is inherently implicit about trips and their sequencing — with an explicit runs.txt file that enumerates the individual activities of a run.

I like this concept and practice and support the idea in general.

`runs.txt` Feedback

The term "piece" as used during the working group discussions was synonymous with the use in many scheduling systems, being the start and end of a portion of work on a given block.
- I think this definition addresses some of the concerns you flagged earlier, as the dropback example would simply model each trip (on different blocks) as their own pieces.
- This still does not address the explicit/implicit bit of being able to simplify the data and say "start block on this trip here through that trip there" vs enumerating everything.
If we're going to enumerate everything, I'd advocate for merging run_events.txt into runs.txt, too.
- The big appeal of having things explicit is that they're all in one place without needing multiple layers of cross-referencing, yet having the run_events separate would continue to bifurcate the data.
- If we did this, runs.txt effectively becomes a file that says, "you do x, from this place at this time, to this place at this time." "x" then is either a trip, deadhead, or event. The only thing missing, then, is a name and type of the event, which we could address by merging the event_type enums in both run_pieces.txt and run_events.txt, and a row_description column to offer not only a textual label for events, but to add to run content as individual operators may see fit.
Alongside the standardization of everything into the explicit, I think this would also warrant changing the run_row terminology in this proposal to a run_event, since working a trip or deadhead is simply a special type of an event itself (further supported by the above).

`runs.txt` Example

To give things a concrete example from https://ods.calitp.org/spec/examples/multiple-runs-single-block-midtrip-relief/, we're effectively replacing

run_id,piece_id,start_type,start_trip_id,start_trip_position,end_type,end_trip_id,end_trip_position
10000,10000-1,0,daily-deadhead-1,,1,102,mid_relief_stop
20000,20000-1,1,103,mid_relief_stop,0,daily-deadhead-2,

with

service_id,run_id,piece_id,block_id,run_event_type,run_event_id,run_event_start_time,run_event_start_location,run_event_start_mid_trip,run_event_end_time,run_event_end_location,run_event_end_mid_trip,event_desc
daily,10000,10000-1,BLOCK-A,1,daily-deadhead-1,08:00:00,Yard,0,08:30:00,FirstStop,0
daily,10000,10000-1,BLOCK-A,1,101,08:45:00,,,08:30:00,,
daily,10000,10000-1,BLOCK-A,1,102,,,,,mid_relief_stop,1,
daily,10000,,,5,,12:00:00,,,13:00:00,,,Lunch
daily,20000,20000-1,BLOCK-B,1,daily-deadhead-2,,mid_relief_stop,1,,,

(added a sample Lunch event as an example).

That does seem to make the lives of consumers and data analysts easier, rather than relying on the implicit calculation that many scheduling systems currently use.

`deadheads.txt` Feedback

As for deadheads.txt:

Deadheads need to retain sequencing since trips/deadheads are not always synonymous with employees, nor are blocks always defined (e.g. while a person may work DH2 after they work trip 100, DH2 may follow trip 200 on a given vehicle; the from/to fields are the only mechanism for defining the 200-DH2 link).
Deadheads need to have start/end times added both to support where trips are supposed to go and to mesh with the runs.txt format (even if these are flexible times akin to the separate duration discussion we were having elsewhere, having a baseline for times is still important).

Other

There are some other more generic things that I plan to raise in a forthcoming GH problem regarding applicability of the standard to passenger rail operations (which has complexities of trips with both revenue and deadhead components, many-to-many mappings of employees, trips, and parts thereof, etc.), but the modifications above would further support those elements with field extensions (e.g. adding a event_type enum field to specify whether the employee working on a given trip is working as a Locomotive Engineer, Conductor, Assistant Conductor, etc.).

skyqrose commented 8 months ago

That's a good summary, thanks.

runs.txt/run_events.txt:

I hadn't considered merging run_events into runs.txt. Removing another file and a ton of duplicate rows between runs and run_events would be pretty good.

Thanks for the example. Some small corrections to the example: run_event_type would be 0 for the deadhead, and it's missing some required locations+times.

Looking through run_events made me realize a potential error in the proposal: If ODS location ID and GTFS stop id can't be mixed into the same column, then runs.txt might need additional columns start_location_type and end_location_type like run_events.txt currently has (or separate start_stop_id and start_ops_location_id fields like deadhead_times.txt currently has).

I'm not sure about separate or merged row_type and event_type fields. In the original proposal, runs.txt:run_row_type is about what kind of data you're looking at and would be used for control flow of which other table to look into. Any new value there would be part of a major spec change. run_events.txt:event_type is just data, a machine-readable description field, and doesn't control anything, and could have new values added frequently. So they could be merged but when handling the data they'd be used so differently, that maybe they shouldn't be.

deadheads.txt:

If an agency needs to link a deadhead and a trip as being done by the same vehicle, isn't a block the right way to do that? You said blocks don't always exist, but why would populating the new redundant from_trip_id field to link deadheads and trips be easier than using blocks?
I was assuming that the start/end time/location of deadheads would be handled by deadhead_times.txt instead of deadheads.txt, similar to how it's done in GTFS trips.txt.

skyqrose commented 6 months ago

I've opened a new issue that takes into account all the discussion from here, and is built on top of #55

https://github.com/cal-itp/operational-data-standard/issues/60

cal-itp / operational-data-standard