ODOT-PTS / GTFS-ride

GTFS-ride is an open standard for storing and sharing fixed-route transit ridership data.
https://gtfsride.org
Apache License 2.0
49 stars 10 forks source link

GTFS-ride aggregation and GTFS versioning #25

Open antrim opened 5 years ago

antrim commented 5 years ago

More documentation and consideration of how this data should/would be managed, stored, and amended over the course of time would be useful.

johanricher commented 5 years ago

Why not just use git? WRI and AFD are experimenting this by proposing a GitLab instance as a data infrastructure for GTFS-producing projects in developing countries: http://git.digitaltransport4africa.org/

Versioning of the GTFS is built-in! (Example for Accra)

e-lo commented 5 years ago

Re Versioning:

It seems to me that versioning of "history" (what GTFS-RIDE's main function is) is a problem with less dimensions that versioning different potential futures. Just two that I can think of, in fact:

  1. correcting an error (i.e. "oops, it should be 500 not 5000)
  2. updating a format/specification (i.e. we are going to make add a column about cats on buses called "cats_onboard")

In the instance of a specification change, I would say that it should be up to the interpreting software to be able to verify and accept different specification versions; AND, per GTFS theory, any new specification should be backward compatible. Therefore I see less of a reason to maintain previous specification versions.

In the case of correcting an error, I do believe it is very important to keep old versions because people will have referenced them.

Therefore, it is fairly straightforward to keep a file in a single git repo with a single branch, advancing (and appropriately tagging) for whichever type of commit you are making (error vs format).

BUT.....

GTFS-RIDE is also a format that can be used for "the future" and here is where I think it gets really dicey (as you can see in the preso that @antrim referenced). SO many potential dimensions. Le sigh.

e-lo commented 5 years ago

W.r.t. "best practice" around file management, I think we should consider that smaller file sizes are better in general because:

  1. git doesn't like big files.
  2. opening a file once it is written should only be done if you are correcting an error; otherwise you might create a new error.
  3. it is easier to spot "diffs"
  4. they are easier to move around

AKA - we should be storing files in the size in which they are created; likely each day IMHO and then at regular aggregation intervals.

carletop commented 5 years ago

As in the presentation from @e-lo, this is an issue for which GTFS-ride doesn't have a solution. The current best practice which the project team has been using to create pilot GTFS-ride feeds follows the process in the comment from @ODOT-RPTD-mb referenced above. The most glaring issue arises when a new feed is published to correct an error in a previous feed. There is currently no mechanism to indicate which feed should be used to associate the ridership data when dates overlap. The most recently published feed is assumed "active" from its date until a subsequently published feed supersedes it. It seems this issue stems from the fact that GTFS is intended to be a forward-looking plan for anticipated services, but GTFS-ride needs a historical account of the services which were actually offered. The frequent publishing of new GTFS feeds is another issue contributing to the clumsy cumbersomeness of needing to handle many, large GTFS-ride feeds. It seems a merged, corrected "GTFS-retro" feed is what is desired. The idea of using GTFS-realtime together with GTFS to create such dataset was an intriguing idea, but probably still far off. I like the git idea as well, but this sounds like a broader issue with GTFS practices than one can be solved here. @antrim should this issue be closed or do you feel that more action is needed here?

scrudden commented 5 years ago

@carletop I have been thinking about accessing demand based on GTFS ride data. One thing that would help to estimate demand is an update on the demand for the previous vehicle to pass the same stop.

This lead me to think about the issue you describe here and in particular your comments about GTFS as being forward-looking and GTFS-ride looking back.

I have in the past used CapMetrics as a source of data for working on predictions. The way they have gathered the data from GTFS-realtime vehicle locations and posting it to GIT was very useful.

It would be very useful if there was a standard way of providing the data corresponding to a row of board_alight.txt in real-time (on doors close). This could then be archived in GIT along with the current active GTFS and in turn use this GIT repostory and the realtime feed to further inform demand predictions.

Is this something that would be possible using current APC systems?

barbeau commented 5 years ago

@scrudden One key challenge for archiving occupancy from GTFS-realtime today is that GTFS-rt only supports a high-level enumeration of occupancy with values like "MANY_SEATS_AVAILABLE, FEW_SEATS_AVAILABLE", etc.: https://github.com/google/transit/blob/master/gtfs-realtime/spec/en/reference.md#enum-occupancystatus

There is currently a proposal being drafted that would allow more details about a vehicle, including more granular quantitative occupancy, to be expressed in GTFS and GTFS-rt. I'd welcome comments and ideas from everyone on the current draft spec: http://bit.ly/gtfs-vehicles

scrudden commented 5 years ago

@barbeau Where is the best place to comment on gtfs-vehicles? Directly in the google doc? From what I have read so far the proposal seems to capture occupancy well but for my intended purpose, I would like to know the number of passengers boarding and alighting at each stop.

barbeau commented 5 years ago

Yes, just comment in the Google Doc right now.