cal-itp / operational-data-standard

The Transit Operational Data Standard is an open standard for representing the transit schedules used by drivers, dispatchers, and planners to carry out transit operations.
https://ods.calitp.org
Apache License 2.0
26 stars 6 forks source link

Supplementing Public GTFS data (trips, stops, stop_times, routes, etc.) within ODS #55

Closed jeffkessler-keolis closed 2 months ago

jeffkessler-keolis commented 8 months ago

Context

The ODS data model is based on the concept of supplementing public GTFS data with internal operational data, capable of together modeling the entire network.

This, in theory, works well: schedule information for public trips is released publicly to customers, and ODS contains all of the non-public information.

Problem

This begins to break down when there becomes "supplemental" information regarding public trips that some operators release publicly, but others do not. For example:

I recognize that some operations already simply release this information publicly in the blocks field and using pickup_type,drop_off_type values of 1,1 — albeit with no distinction as to whether that's a stop or merely a passing time — but this public release is simply not viable for us, nor many other operations… particularly on the rail side.

Likewise, there may be scenarios where public information differs from internal information, such as:

Clearly, if we're looking for ODS to be adopted more widely and within the rail operating space, it needs to accommodate these requirements.

Proposed Solution

In thinking of a proposed solution, I wanted something that would be extendable and could future-proof us for subsequent use cases that we might not yet have conceptualized. After bouncing around a few concepts in my head, I believe I've settled on a new standardized _supplement.txt suffix, capable of:

  1. Adding rows to the corresponding public GTFS file.
  2. Adding/replacing values in the corresponding public GTFS file where a row with the file's applicable primary key / unique identifier already exists.

Each file would use the same base name and fields in accordance with the GTFS standard.[^1]

This would cover all of the above use cases, and then some. For example:

Implementation Approaches for stop_times at Public AND Internal Locations

The above leads to a need to add entries to stop_times.txt at both existing public and internal locations, which itself leads to three/four interesting hypothetical options:

  1. Define that all supplement entries need to precisely mirror their GTFS counterparts, and thus the places must be defined in stops_supplement.txt.

    Add: stops_supplement.txt:

    stop_id,stop_name,stop_desc,stop_lat,stop_lon
    DoubleTrackStart,Start of Double Track,Milepost 5,40.000,-75.000
    DoubleTrackEnd,End of Double Track,Milepost 20,40.100,-75.100
    BigTrainYard,Big Train Yard,,40.105,-75.105
  2. Define that stop_times_supplement.txt supports the definition of an ops_location_id in such an eponymous column with stop_id omitted, a break from the stop_times.txt standard.

    Modify the earlier stops_supplement.txt example to:

    trip_id,stop_id,ops_location_id,stop_sequence,arrival_time,departure_time,pickup_type,drop_off_type
    trip3,,DoubleTrackStart,18,13:40:00,13:40:00,1,1
    trip3,,DoubleTrackEnd,18,14:00:00,14:00:00,1,1
    trip3,,BigTrainYard,18,14:05:00,14:05:00,3,3
  3. Merge the ops_locations.txt file into stops_supplement.txt and just treat all ops_locations as added stops.

    • This mirrors the example from approach 1 above.
    • The biggest implication from this is it means ops_locations and stops must have mutually-exclusive IDs, although I don't think that's a terrible thing in the grand scheme of things.
  4. Take approach 3 a step further and merge all of the analogous files (deadheads.txt, ops_locations.txt, deadhead_times.txt) into their _supplement counterparts (i.e. trips_supplement.txt, stops_supplement.txt, stop_times_supplement.txt).

    • This has the benefit of reducing the additional files and structures being added in ODS for the portions where we're simply adding internal equivalents, and allows us to piggyback on the existing standard.

    • One potential risk of this approach is it leaves us susceptible to a potential breaking change in the future should we ever implement an ODS-standard extension (e.g. adding an ops_location_field to stops_supplement.txt) that conflicts with a future GTFS change, but I think I'd generally advise against any such additional fields in general.

    • This approach would allow us to eliminate all of the definitions and conditional requirement fields that add to the complexity of some of the supplemental fields by placing all trips, stops, and stop_times in a single merged datasource.

    • Validation between internal and external fields becomes easy in that one can easily verify that all IDs listed are unique.

    • The learning curve for individuals looking to implement ODS becomes lower, as it's simply modeling the internal trips in a supplemental file vs calling otherwise analogous items by distinct names depending on the context.

I realize approach 4 would be a relatively major/breaking change to the standard, which I'd normally reject, but it might be worth considering since (a) the standard is not yet widely adopted, and (b) those who have implemented the standard would only need to change file/column labels vs adding new logic (some of which — e.g. the comparative field ID values — could be eliminated entirely, thereby further reducing complexity and reducing the barrier for implementation).


Curious for everyone's thoughts/input on this, as I not only see this being useful in the context of modeling runs (and required to do so for our operations), but also see it being valuable for helping grow support for the standard as many operations that may not yet be ready to implement full run-modeling in ODS could have a use case for modeling deadheads and trips with internal locations (e.g. Rail AVL systems where we don't care who's working the trip, we just care about the trip, its waypoint times, its cycle, etc. but have been unable to use GTFS given the need to combine the two elements). That could further solidify the standard's role in the industry and help grow support for widespread adoption.

[^1]: We could also consider an optional _NEW field suffix for changing a field's PK value, but I am disinclined to do so as (1) I don't foresee there being a compelling need, and (2) there is a major risk of downstream propagation issues by implementing a PK change.

skyqrose commented 8 months ago

So to summarize:

Instead of making new files to describe internal-only trips and stops, publish a diff of rows+columns to add/change to the existing GTFS files.

I really like this! It's such an elegant solution. It smooths out a bunch of awkward parts of ODS files, and makes it really obvious how to do any future extensions (official or custom) for new fields.

I lean towards option 4. If we do it for any file, we might as well lean into it.

I'd want to go through all the old discussions to make sure there's no use cases that this makes impossible to represent, but off the top of my head I don't see anything it would break.

Q1: Will there be any specific new column names that ODS needs to standardize, beyond columns already listed in GTFS? I'm thinking:

Q2: If this is meant to modify an existing GTFS feed, but it's published separately, then it matters that you're applying it to the correct GTFS file. Maybe ODS needs a metadata file that references GTFS's feed_info.feed_version.

Q3: Would this allow supplementing any file and column, or only a small allowlist of ODS-approved files and columns? Like, if an agency wants to write internal pathways or something unexpected like that, is that allowed by this spec? Are consumers expected to handle it?

jeffkessler-keolis commented 8 months ago

Thanks, Sky! To your questions:

[Q1] Yes, this is certainly an option we could pursue, albeit with the caution/risk that an equivalently-named field is added to GTFS that would interfere with this use case.

Overall, to this question, I think it would be worth maintaining a list of standardized supplemental fields and extensions in ODS, which would also help bolster the future case for avoiding any collisions with subsequent GTFS additions.

[Q2] This is a valid point, although I could see cases where the two are decoupled and an ODS value is valid on the same public IDs across versions, or where the reverse is true. My thought is that valid approaches would be:

[Q3] I don't see any reason why the standard couldn't support additional fields/files akin to how GTFS currently treats such supplemental fields/files in the base files. However, beyond a requirement to at worst ignore the extraneous data and proceed, I think the obligation for a consumer to support these extensions would depend on the context and use cases of the given consumer's application/tool.

skyqrose commented 6 months ago

So far we've discussed adding and changing rows. Is there a way to remove rows?

A couple examples of situations where we might want to remove data: - We have a couple through-routed trips, where a bus does one route and then continues through to another route. We show these to riders as two separate trips on their respective routes, but operationally they're more like a single trip. We'd want 2 trips in the public GTFS file, and 1 in the ODS data after merging. - Sometimes during planned disruptions (construction) we make up trips for GTFS which approximately reflect the service we'll run, but these trips are a just a useful fiction for the public. Internally, we have a different representation of service, so would want to remove those trips in the internal merged feed.

One potential way to do that: All _supplement files could have an optional delete column. If that column is set to 1, then instead of merging the data, that row is deleted. If it's 0 or blank, the data is added/changed. The column wouldn't appear in the merged data.

jeffkessler-keolis commented 6 months ago

I'm a bit torn on this functionality. Provided there's not a requirement for all trips to be assigned to operators, I see a case where these trips could simply (a) go unassigned and be ignored by a consumer, or (b) removed by splitting the applicable trips into a separate service_id as needed in trips_supplement.txt, and removed from a given day by a calendar_dates_supplement.txt type 2 exception entry. Yes, consumers would need to check another _supplement file, but the mechanics of doing so should be fairly generalizable.

If we were to include a "delete" command, I like the mechanics of adding an optional delete column that removes a matching row when set to 1, but does nothing in any other instance. This would obviously preclude GTFS producers from including a synonymous delete column in their public GTFS (perhaps an argument to use ODS_delete), but I don't foresee a future conflict.

safrazier17 commented 6 months ago

I think out of an abundance of caution it would be smart for us to put "ods_" as a prefix for any column we add to the supplements. This will also make it obvious on visual inspection what is not coming from the original spec.

safrazier17 commented 6 months ago

To clarify, are you still proposing to add routes_supplement.txt in addition to supplements for trips, stops, and stop_times? If yes, does that cover everything that we would need supplemental files for? @jeffkessler-keolis

jeffkessler-keolis commented 6 months ago

I'm going to say something that may be a bit antithetical to the consumers, but I don't think we need to be prescriptive as to the _supplement files supported. Theoretically, there's no reason why any GTFS file could not be modified in this fashion, be it with additions or overriding by the filename's eponymous _id field.

The same even holds true for experimental files, such as the MBTA's multi_route_trips.txt (which indicates trips that should be displayed on timetables beyond its specific route); there's no reason why the same structure couldn't be applied to modify the public version of this file for internal consumption.

Obviously there are risks/concerns to this from a consumer side of knowing what files need to be implemented, but I think one could say that any file on which one needs to rely in the GTFS data for an ODS purpose could be modified via the _supplement standard. i.e. If you need it for GTFS, you should be prepared to have it modified via ODS.

Realistically, I believe trips, stops, stop_times, and routes are the primary files that one would reasonably expect to change via ODS, but to the extent a consumer may wish to rely on another GTFS file for their instance / application / use case, they should expect that the file can be added via new rows or have fields overridden via a row with a matching primary key in an applicable _supplement file.