Extremely high RAM requirement for `gtfs_transfer_table` with large dataset

jh0ker commented 1 year ago

Hello!

We've been using this library to parse a large GTFS feed and computing travel times between many stops using gtfs_traveltimes. However, we noticed it seemed to miss some connections that it should have found. I realized that this was due to missing transfers between stops (even those that share the same name), so i attempted to use gtfs_transfer_table to fill in those missing transfers.

On the CRAN version (which i believe was released before this algorithm was ported to C++), this crashes basically immediately. On the latest GitHub version (as of yesterday), this runs for a couple minutes, as i would expect for a large dataset, but then it crashes as well. Both crashes happen with the same error message:

Error: cannot allocate vector of size 1675.0 Gb

Now, i would expect maybe something in the order of a couple gigabytes to be required to store the result of this operation but not terabytes. Is this expected due to some inherent limitation of the algorithm itself or is it perhaps possible to make it at all viable for our use case? I'm happy to try and monkeypatch it just for myself or even make a PR, however I'd be grateful if you could point me to the direction where this extreme memory allocation likely occurs, as i have very little experience in debugging R code.

My script:

install.packages("remotes")
remotes::install_github("ATFutures/gtfs-router")

library(gtfsrouter)
packageVersion("gtfsrouter")
# [1] ‘0.0.5.102’

gtfs_de <-
  extract_gtfs("20220829_fahrplaene_gesamtdeutschland_gtfs.zip",
               quiet = FALSE)

timetable <- gtfs_timetable(gtfs = gtfs_de, date = 20220926)

transfertable <- gtfs_transfer_table(timetable, d_limit = 500)

The source for the GTFS data is DELFI e. V., there is an official download here but it's regularly updated and you need to create an account to download. I've uploaded the exact file I've used here for your convenience.

Thank you for your work on this very useful library. Please let me know if you need any additional information from me.

mpadge commented 1 year ago

Thanks @jh0ker , and unfortunately for you, yes that is an internal design feature. "Fahrpläne_gesamt_DE" is enormous, and under the hood the routine needs to calculate pairwise distances between all stops. For N stops, this is as N-times-N matrix which would indeed be enourmous for "gesamt_DE"!!

Most of those distances are ultimately not needed, so this glitch in your workflow could technically be avoided, but would require relacing current all-pairs algorithm with some kind of heuristic spatial neighbours algorithm that would, for most use cases, actually be slower than current code.

(The actual sizes of pre-allocated vectors are OS-dependent. From your error message, I guess you're on some kind of Linux system, on which actual sizes can be tweaked, but won't stop the error in this case.)

I'll have a bit more of a look into potential memory reductions and get back to you. Thanks!

jh0ker commented 1 year ago

Hello @mpadge, thank you for the quick response!

unfortunately for you, yes that is an internal design feature.

That's unfortunate indeed, although I figured it was something like that. I suppose I was hoping this was already addressed in the C++ implementation (eg. by going through the stops in chunks or something like that) and this large allocation was just a leftover. It does seem a little surprising to me that it does multiple minutes of calculations before the error occurs.

"Fahrpläne_gesamt_DE" is enormous

Yes, I'm aware 😜 I'm okay if the only way for this to work is me writing my own code for filling in the transfers that may be less optimized. Speed isn't a huge concern for me in this case. I still figured it might be worth asking here.

I guess you're on some kind of Linux system

I'm on Windows 10 at the moment.

I'll have a bit more of a look into potential memory reductions and get back to you.

Awesome!

Thank you again and have a nice rest of your evening.

mpadge commented 1 year ago

@jh0ker The good news: The algorithm can be rewritten to avoid constructing the huge matrix. The bad news: It can't be done in R. A trial implementation runs about 6 times slower, because the current huge matrix construction is actually very efficient as long as it fits in memory.

That just means i have to rewrite that entire bit of the algorithm in C++. It should then be possible to make it even faster than current version, so is definitely something generally worth doing beyond your particular case. So i'll happily leave this open and document implementation here. Shouldn't take too long.

jh0ker commented 1 year ago

@mpadge Sounds awesome! I'm not sure we'll be able to profit from that improvement since we're on a bit of a deadline, but I'll be happy to try it out once it is done and compare it with my solution.

I've been working on my own little Python script that basically divides the area into equal-sized chunks of 50km x 50km with an overlap of 1km and generate transfers for each chunk, deduplicating afterwards. I'm gonna let it run overnight probably (looks to take a couple hours) and hopefully the result will be sufficient for now. Not the best approach but good enough and easy enough to write and optimize for with Python & GeoPandas. So don't feel compelled to rush this for my sake 😄

I do have one more question that came up in relation to this, if you don't mind: Does gtfs_traveltimes use any kind of "implicit" transfers, eg. consider a transfer via a stop with the same stop_name or parent_station even if there is no explicit record for it in the transfers table? What about a transfer on the same stop? I don't think it does, but it would be great to have it confirmed.

Either way, thank you once more for your help and have a nice evening.

hansmib commented 1 year ago

As far as I can remember you are right, but @mpadge created a super helpful feature that can create a transfer table for you. https://atfutures.github.io/gtfs-router/reference/gtfs_transfer_table.html https://github.com/ATFutures/gtfs-router/blob/main/vignettes/transfers.Rmd

You might run into issues with the size of the feed or transfers across the individual chunks, but it's worth checking it!

jh0ker commented 1 year ago

@hansmib Ah well that function is indeed the subject of this issue.

To clarify, I have now written a script that is basically a pre-processing step that i run on the feed to extend the tranfers.txt with the missing transfers, just as gtfs_transfer_table does. Then i zip up the result and load it with this library.

Anyways, thanks for the confirmation 😄

hansmib commented 1 year ago

Ups! I should've read at least the subject. 🤦

mpadge commented 1 year ago

@jh0ker All done now anyway. Most of the work had already been prepared from #14 anyway, so this was a good incentive to finish that. The function should now be much faster in general, especially on larger feeds, and should also work fine for arbitrary numbers of stops. On a trial feed of around 2,000 stops this new version is about 25% faster. Now just have to document all of this via #85.

jh0ker commented 1 year ago

Oh that's great! I'll give it a test right now!

jh0ker commented 1 year ago

@mpadge I ran it for about 8 hours today, but it didn't finish within that time. Had to abort it since i need the RAM to run other data processing for my project ^^ I suspect that for my use case, it's still slower than my version that only compares the distance within a chunk. This is probably a necessary optimization for a feed as big as mine, just to make that O(n²) complexity not hurt quite as much. I think my version finished in about 4 hours.

On that note: I had a quick look at your implementation. Have you considered only calculating the distance for each unique pair of two stops once in either direction, and then at the end simply duplicating all your transfers mirrored in the opposite direction? The way i understand it, you could essentially skip half the work with no downside and very little effort (please correct me if I'm wrong on this).

I'll have another go at running your version when I have some free CPU time, even if it's just to compare the results.

mpadge commented 1 year ago

Thanks @jh0ker, and interestinglyl, the former routine only calculated distances between unique pairs, and that was the only bit which i did not carry across to the new one. I figured that calculation would generally be quick enough that it should generally not matter that much, but obviously in your case it does. The problem with unique pairs is that it requires conversion of coordinate pairs to some singular representation, for which the old code rounded both numbers to fixed precision and concatenated them as strings. That introduces extra overhead for "normal" use cases, so there's a trade-off here. But i'll re-open to investigate further what can be done.

mpadge commented 1 year ago

Not wrong issue reference on those comments, which should have been #89 and not here

mpadge commented 1 year ago

@jh0ker I've got a PR in #92 with C++ code to only calculate transfers between unique pairs.

The bad news

In a pretty standard-size test feed of just over 2,000 stops (with very few duplicated coordinates) it slows things down by around 50%; from 127ms to 181ms.

The good news

For the "VBB" feed for Berlin-Brandenburg with over 40,000 stops, it's almost 90% faster, from 51s to 6.5s.

Conclusion

As long as we may presume that feeds with more stops will tend to have more redundancy, then speed ups should generally be most notable on larger feeds, and any feeds for which calculating transfers only between unique pairs is likely to slow things down will be small enough that it'll all calculate effectively instantaneously anyway (< 200ms in trials above). I think that all means this is all very worth doing. Any thoughts before i merge?

jh0ker commented 1 year ago

Awesome, i'll check it out right now!

By the way, the project (at least the first version of it) that sparked this issue is now live here (it's in German).

Also, I believe I found a bug in the gtfs_traveltimes function, I'll open up a separate issue for that in a bit as well. Among other reasons, that's why we ended up using tidytransit instead of this package.

mpadge commented 1 year ago

@jh0ker It's taken a while to get back on to this, but the above commit implements your idea of only doing n(n-1)/2 comparisons instead of n^2. These are the results:

devtools::load_all (".", export_all = TRUE)
#> ℹ Loading gtfsrouter
gtfs <- extract_gtfs ("./feeds/mannheim-gtfs.zip")
#> ▶ Unzipping GTFS archive✔ Unzipped GTFS archive
#> Warning: This feed contains no transfers.txt 
#>   A transfers.txt table may be constructed with the 'gtfs_transfer_table' function
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds
d_limit <- 200

bench::mark (
    transfers <- rcpp_transfer_nbs (gtfs$stops, d_limit),
    transfers <- rcpp_transfer_nbs2 (gtfs$stops, d_limit),
    check = FALSE
)
#> # A tibble: 2 × 6
#>   expression                                                min   median itr/s…¹
#>   <bch:expr>                                           <bch:tm> <bch:tm>   <dbl>
#> 1 transfers <- rcpp_transfer_nbs(gtfs$stops, d_limit)     612ms    612ms    1.63
#> 2 transfers <- rcpp_transfer_nbs2(gtfs$stops, d_limit)    164ms    164ms    6.09
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>, and abbreviated
#> #   variable name ¹`itr/sec`

^{Created on 2023-01-13 with reprex v2.0.2}

Next commit will ditch the previous form in favour of the new one. The initial code to reduce down to unique pairs of coordinates will not be retained, because indexing in this new form is already complicated enough, and indexing back out of unique pairs to full indices on top of that would make the code too difficult to understand. Likely only a mild loss of efficiency in most cases, compared with the huge gains from reducing the n^2 down to n(n-1)/2.

mpadge commented 1 year ago

Merging of #92 now addresses all of this, so closing now. Thanks for all of your help @jh0ker!!

jh0ker commented 1 year ago

Awesome, glad to hear it! Looks like it gives some substantial improvements indeed :) Even more than i would have expected really, but who would complain about that ^^

UrbanAnalyst / gtfsrouter