add persistent id's - Githubissues

BTollison commented 9 months ago

This proposal is intended to allow for sending data between systems without losing context when vendors add / replace key values. The goal is this could enable applications to have a circular workflow, where for example running time is captured in a CAD/AVL system and can be easily aggregated by a Scheduling system to make schedule adjustments.

USE: When an persistent ID is not present, the first software to export this data will generate one. From that point forward, this ID will persist. When there is no value present on import, the software must find a unique values to add. All values are integers (perhaps INT64?).

stops.txt

Add ods_stop_id

routes.txt

Add ods_route_id

trips.txt

Add ods_trip_id

shapes.txt

Add ods_shape_id

ops_locations.txt

Add ods_ops_location_id

rosters.txt (proposed in issue #45 )

ods_roster_id

skyqrose commented 9 months ago

Am I correct in understanding that this is to solve the problem of an external system importing GTFS data, and forgetting the GTFS stop_id (or route_id, etc) because the external system has their own incompatible stop_id field?

If an external system is capable of saving the new ods_stop_id to compare it to GTFS/ODS files in the future, then it should also be capable of saving the existing stop_id in some new gtfs_stop_id field, and comparing based on that. We already have a persistent id to compare between GTFS files, it's stop_id, and I don't thing that adding a new field to GTFS would be the right way to handle other applications not handling GTFS stop_ids, correctly.

safrazier17 commented 9 months ago

We already have a persistent id to compare between GTFS files, it's stop_id, and I don't thing that adding a new field to GTFS would be the right way to handle other applications not handling GTFS stop_ids, correctly.

This sounds right to me. Is there an example dataset you could give in support of the use case you laid out @BTollison? For the GTFS files/fields listed in the OP, those don't exist independently in ODS. They are only referenced from the GTFS feed (which we presume will have been generated and packaged alongside the ODS files in most cases).

Edit: all of which is to say, in rare occasions those IDs may change over time in successive versions of a GTFS feed, but they shouldn't be changed unthinkingly or be assigned arbitrarily by a producing system.

GTFS Best Practices gives this guidance: Maintain persistent identifiers (id fields) for stop_id, route_id, and agency_id across data iterations whenever possible. I agree with Sky that this is an area in which enforcing the proper use of the spec seems to be the way forward.

BTollison commented 9 months ago

@skyqrose It's true that these systems should be able to save these id's (boy wouldn't that be nice). The goal is to allow for example a scheduling system to send data to a network planning software and visa versa, but also from CAD/AVL back to scheduling / network planning for example.

The problem we encounter now is two parts:

1) All GTFS is generated at the post scheduling system level (such as CAD/AVL) because then you're able to generate real-time feeds. Because of that process our beloved id's from the systems before it are thrown out in favor of the system that generates real-time data. I've seen this many times with many systems. It makes it impossible to reconnect the data back to the systems that generated the static information.

2) The data types for id's are sometimes incompatible, and so the lazy way vendors appeared to have solved this is simply replacing all the id's with their own.

I'm not totally confident that forcing vendors to keep an id the same would solve this because of where the data is often generated. I suppose if it's explicit that GTFS/ODS require vendors to keep id's, and we use it as our export from scheduling to operations then in theory it can be enforced?

Really open to suggestions here, I would love for a way to make this possible. I think it's easily one of the biggest issues that make life hard for the network planning and scheduling teams to make use of data, but also for teams building operational software that are losing context that was in the scheduling system but can't reconnect with it because primary keys have changed.

By the way, for what it is worth, I envision one import/export as a combo GTFS/ODS at one time instead of 2 from our vendors.

skyqrose commented 9 months ago

Hm, a couple ideas:

Could GTFS use the ids from the scheduling system?
Could static GTFS be generated earlier in the process, where it has access to upstream ids, or where some downstream data users could now have access to canonical GTFS ids? (I think it should be fine if static GTFS is made significantly upstream from GTFS-RT.)
Could you use a nonstandard column in your GTFS data to save scheduling system ids alongside GTFS ids?

BTollison commented 9 months ago

If we had the ability to send GTFS / ODS from planning <> scheduling to operations software, then in theory that fixes it so long as the vendor is forced to retain the id's. I want to avoid adding custom fields as much as possible because the less customization we have to do.. the cheaper it is for everyone :)

jeffkessler-keolis commented 9 months ago

(In drafting this reply, I went from strenuously objecting to the notion to largely being in support of the need, so bear with me.)

I think there are a few issues at play that need to be decoupled for the sake of a broader discussion before a solution can be reached. In my mind, these are:

IDs in some systems are not modifiable by the user.
IDs sometimes need to change to support operational changes.
Operators sometimes have discrepancies between public and internal values.
Static IDs in stops, routes, trips, etc. are all — by their very nature — public values.
Discrepancies result in not always having a 1:1 mapping between public and internal values.

The first two are perhaps the easiest to address: make that a requirement in the vendor systems. Even if there's a database identifier that exists, there's no reason they can't support some additional text field to serve as a static identifier that can be persistent across versions.

The final three are the ones that present a bit more of a challenge, and have implications for the prior two. For example, in our North American rail operation, there are certain distinct lines in our scheduling system that are advertised to the public as operating on the same line. Likewise, there are certain lines that are operationally identical for internal purposes, but distinct to the public.

Right now, we use our scheduling system to publish our GTFS and have built custom attributes on various stop objects that allow us to use a separate value and/or override a value for the sake of persistent stop values. We also have various formulas that compute the public GTFS route_id for each trip, based upon the internal route identifier and other relevant bits. The internal data is what we pass along to our AVL system, but included alongside of it are the static GTFS values our AVL system should use when publishing GTFS-RT to match the public data.

In order to support such a pass-through, we'll need some way to handle the differences. The discrepancy increasingly leads me to the view that it might be worth continuing a practice many of us already undertake: publishing a "public" GTFS feed with the main GTFS values, and then a shadow "internal GTFS" feed that uses the nomenclature in our root scheduling system that can match what is passed through ODS.

The challenge this then becomes, as alluded in @BTollison proposal, how you map internal values to public values. Given that we can't rely on modifications to the GTFS standard, it seems some sort of ods_gtfs_lookup.txt file of field, internal_id, public_id would be sensible, whereby — where defined — ODS references to a certain value are mapped to a corresponding public value such as the below.

field,internal_id,public_id
route,XA,Fake Train Line
route,XB,Fake Train Line
stop,TERMINAL,12345

I am not completely sold on the approach, but it's the best one that I've been able to resolve in my mind that can address this need. Perhaps a worthy contender for discussion with the larger working group, if we were to resurrect it.

BTollison commented 9 months ago

Alright, I like where you're going with this. I think those are totally valid points actually.

I think the setup of what you have proposed works fine, unless there is a wish to make the field an enum? But I think yours is more flexible. where we can list the id (ex. stop_id) and present both the given ID valid and the one that you want to tie it to.

The challenge I see here is that this is perhaps more difficult for vendors to implement because they will need to give us the flexibility to allow us to map fields. I'm hoping to have a solution here that doesn't require an IT person to help all the time.

westontrillium commented 8 months ago

Weston from Trillium/Optibus here.

Adding an ods-specific alternative id to each relevant field in both ODS and GTFS does not seem tenable to me. If permanence of original GTFS ids cannot be reasonably enforced, I’d recommend a solution like the ods_gtfs_lookup.txt file @jeffkessler-keolis proposed, while also promoting id consistency between systems as a best practice (this is about data standardization and interoperability, after all).

The function of “ods_gtfs_lookup.txt” looks very similar to the translations.txt file in GTFS so I’d take queues from there as well.

safrazier17 commented 6 months ago

My understanding is that this issue has been resolved (or at least absorbed) in full by the proposal #55, which should collapse the previous distinction between ODS ids and GTFS ids.

I am closing this issue and @BTollison or @jeffkessler-keolis can correct me if I'm mistaken.

cal-itp / operational-data-standard

add persistent id's #44