georust / transitfeed

Public transit serializer/deserializer and manipulation library for Rust
Apache License 2.0
16 stars 4 forks source link

1.0 API Design #5

Open teburd opened 6 years ago

teburd commented 6 years ago

It would be much nicer to simply work with the .zip file

Ex:

let gtfs = GTFS::from_zip("gtfs.zip").unwrap();
for agency in gtfs.agencies() {
   println!("{:?}", agency);
}
for stop in gtfs.stops() {
   println!("{:?}", agency);
}

Solved with #14

teburd commented 6 years ago

For a 1.0 release lets talk about what we want this crate to look like and how we want it to work.

I think we're on the right track so far. Generally speaking I'd like to differentiate between GTFS the format, and a TransitNetwork the API.

Things I'd want to be able to do with a GTFS type

Versus things I'd want to be able to do with a TransitNetwork type

I'd like to think that TransitNetwork is a trait, not an implementation, and that there would be an implementation for GTFS but that their might also be an implementations for perhaps the wide variety of live transit API's out there.

I think a lot of what your thinking is similiar to what I was thinking for what I'm tentatively calling the TransitNetwork trait. It can aggregate shapes/stop times, have various indices and such to help provide the convenience I think most people (myself included) really want to get out of the data that is stored in GTFS and provided by various transit APIs

I look forward to your thoughts. If we get a nice plan together we can break things down and work on various parts individually and together as we need to.

medwards commented 6 years ago

I don't need writing or validation atm but FWIW I think you're right that it should be included in a 1.0 release. GTFS reading is also still unfinished: we're missing Extended GTFS support. Whether that needs to be in place for a 1.0 release or not is a separate discussion, for now I'm just reminding you.

As far as TransitNetwork stuff: I think you've defined a good set of high level features that are of interest to feed consumers. I'd be careful about trying to get abstract and making it a trait at this juncture, in particular GTFS makes a lot of decisions about its domain model that aren't reflected in other specs. That's a decision we can always revisit anyways.

I have some ideas for an extended feature set for TransitNetwork too but the scope is already pretty big so imma hold back.

teburd commented 5 years ago

I think ignoring writing out a transit feed for now is fine, and validation is partially done by simply reading the files the way we are, as there are some formatting checks done already, in my opinion what we have now after a lot of great work from @medwards is 1.0 unless anyone feels otherwise

medwards commented 5 years ago

I briefly looked over the GTFS Extensions ( https://developers.google.com/transit/gtfs/reference/gtfs-extensions ) and I'm worried about the optional columns that it introduces. Those can't be supported without breaking backwards compatibility right now.

I'd also want to introduce a ShapePoint and StopTime helper before announcing it anywhere but that's not a 1.0 blocker.

teburd commented 5 years ago

@medwards sounds like a plan to me

derhuerst commented 4 years ago

Hey! 👋

I'm currently working on https://github.com/public-transport/gtfs-utils/pull/25, an overhaul to the gtfs-utils JavaScript library. I thought about porting it to Rust for better performance and then found this repo.

My 2 cents on API design from the gtfs-utils/JavaScript perspective:

reusability

It can aggregate shapes/stop times, have various indices and such to help provide the convenience I think most people (myself included) really want to get out of the data that is stored in GTFS and provided by various transit APIs

Answering quite basic (but very relevant in practical usage of GTFS) questions like When does any vehicle depart at a bus stop? is surprisingly much work: GTFS Time values are inherently timezone-dependent, frequencies.txt with exact_times=1 defines "stop times" as well, etc. With the ever-growing number of optional parts and extensions, doing GTFS processing right is a lot of work, so we should make the implementation in this project as reusable/flexible as possible.

Also, a project- and language-independent test suite, i.e. a set of fixtures per "question"/operation, would be very helpful for this. Those have been very successful in other areas, e.g. for WebSocket implementations.

storage-independence

Personally, I really want GTFS to move away from .zip archives. They are inherently unfriendly to many things that GTFS would benefit greatly from: ever-updating "live" feeds, caching, content-addressed storage, sparse replication/access. There are far better tools for transferring/packaging/versioning a set of files!

With gtfs-utils, I try to push towards storage-independent GTFS processing (as in read trips.txt from somewhere, i don't care). Public Rust Traits seem to be a great tool for this.

scalability

GTFS feeds will be significantly larger than the hundreds-of-mb-feeds that are common now; The Germany-wide feed is 2.5GB already, so, for example, a European feed including a lot of shapes will probably be dozens of GB in size.

With gtfs-utils, I therefore try to read as little data into memory as needed for a certain operation, and add a storage API layer for storing intermediate data in other places than memory. In gtfs-utils, this is an async key-value store API that uses memory by default; Again, a publicly exposed Trait seems to be very fitting. This of course still leaves the possiblity open to read all data into memory for low latency and high performance.

If the input files are sorted in a specific way, we can increase processing speed as follows:

validation

There are a bazillion validation (i.e. "semantic checks on the actual data") cases; The best practices page is long, and the GTFS issue tracker and mailing lists are full of edge cases. There are at east 20 libs across languages doing some form of validation, but none of them cover all the issues that we see with GTFS feeds in the wild.

I'd dare to say that people don't care which language a GTFS validator is written in, but they strongly prefer a certain language for "questions"/analysis. Like the "questions"/analysis mentioned above, validation lends itself to a project- and language-independent set of fixtures, maintained by the wider GTFS community. I hope this will push the overall quality of GTFS feeds, and reduce the amount of duplicated work poured into all those GTFS validation libs. I therefore propose not to put too much effort into validation in this project (I'm obviously just a random stranger telling what to do 😬).

Edit: I have created https://github.com/public-transport/ideas/issues/17 for the out-of-scope task of creating such a cross-project GTFS test suite.

antoine-de commented 4 years ago

just to give some pointers (and it might gives you some ideas), there are already several rust libraries for GTFS handling:

derhuerst commented 3 years ago

Just giving an update on https://github.com/georust/transitfeed/issues/5#issuecomment-649095015 here.

I have implemented public-transport/gtfs-utils#25, gtfs-utils now relies on a specific order in the individual GTFS files, in order to only read those rows into memory that are relevant for a specific merge operation, e.g. when merging stop_times, trips, & calendar/calendar_dates. With JavaScript being inherently unsuited and slow for this type of sequential data processing though, I came back to find out how to do common higher-level GTFS operations (like "Which vehicles stop a stop A at Nov 3rd 7pm?") in Rust.

I'm a Rust junior, so forgive me if I ask such naive questions, but is it true that gtfs-structure is essentially the same thing as this project, except that it can optionally read data into a HashMap? If that is the case, let's discuss merging the two projects!

teburd commented 3 years ago

@derhuerst

This crate provides a lazy iterator over CSV rather than attempting to parse and load the entire GTFS file set into memory all at once.

Something I found particularly painful when writing tflgtfs (transit for london to gtfs)