OpenTransport / gtfs-csv2rdf

Mapping script which transforms GTFS CSV into GTFS RDF (turtle, jsonld or ntriples)
http://vocab.gtfs.org
MIT License
13 stars 7 forks source link

URI persistence strategy #1

Closed pietercolpaert closed 9 years ago

pietercolpaert commented 10 years ago

URIs with eternal semantics

It's not difficult to transform a GTFS CSV archive towards GTFS RDF once. It is however very difficult to do it twice:

How are we going to make sure that the identifiers we use to identify trips, routes, stops, stop_times, and so forth are going to have the same semantics after a second mapping? E.g., a stop with id 1 in the first dataset, may have an id 2 in a next version. When mapping the next version, the URI generated with id 1 will be overwritten by another stop.

We can imagine that for stops it may be solved by doing reconciliation on the dataset instead of relying on the "id" column. E.g., by using a combination of the name and the location to find the URI to be used when mapping that data. However for e.g., trips, routes stop_times, it's a more difficult story.

Suggestion to the GTFS community

I don't have a real solution for this problem in this mapper. It is a data problem with GTFS CSV: it is impossible to unlock this internal model into an open world where we need GUIDs which remain the same. To that extent, I would like to change the specification of GTFS itself to use GUIDs instead of local IDs within their CSV files. This however adds a big new responsibility to the data maintainer, yet doing this investment world-wide may be worth it.

So what needs to change? The base of the URI that we use in this mapper is http://gtfs.org/{something}/{feedname}/. This introduces a globally unique identifier for all local IDs in the GTFS file. Yet we now need to introduce persistence. This is only possible if we require the same stops to have the same IDs over and over again in different versions of the file, which is not a requirement at this moment.

pietercolpaert commented 10 years ago

I've posted this call on the GTFS mailing list: https://groups.google.com/forum/#!topic/gtfs-changes/Z8Mf31MaZms

elf-pavlik commented 10 years ago

we used UUIDs for GTFS feed of Matera, of course personally i prefer URIs but UUIDs give a small step for people used to local numeric IDs

pietercolpaert commented 10 years ago

If their local ids are persistent we can easily create globally unique identifiers by prepending http://gtfs.org/{feed name}

pietercolpaert commented 10 years ago

I'm going to suggest to have persistent URIs per feed version now: http://data.gtfs.org/{feed_name}/{feed_version} becomes the base

pietercolpaert commented 9 years ago

Opened a suggestion on the GTFS mailing list: https://groups.google.com/forum/#!topic/gtfs-changes/ZPxhYMoNr0U