UrbanAnalyst / gtfsrouter

Routing and analysis engine for GTFS (General Transit Feed Specification) data
https://urbananalyst.github.io/gtfsrouter/
82 stars 17 forks source link

`frequencies_to_stop_times` slow execution speed in large datasets #89

Open cseveren opened 2 years ago

cseveren commented 2 years ago

Thanks for the awesome and helpful package! I get a lot of value out of frequencies_to_stop_times, but I've noticed that it's quite slow when working at scale (i.e., it takes several hours on my reasonably powerful desktop to go from 37k stop times and 1k trips to 4.4mil stop times and 140k trips for recent Mexico City freq-based GTFS feeds). This is almost assuredly due to the use of rbind here to append row by row onto the eventually quite huge stop_times data.table which is used to store and eventually output results:

 stop_times <- rbind (stop_times, stop_times_trip_i)

I propose three alternatives, which I've tried to implement locally on my machine but I've not worked with rcpp before and so I couldn't make work. I can't quite tell if stop_times_trip_i is always 1 row long or not. If so:

Else (if it's number of rows vary), here are two other ideas:

This last option, while not totally efficient, will probably substantially increase performance without needing to explicitly deal with preallocation. Again, thanks for the great package!

cseveren commented 2 years ago

Ok -- In my use case, following the third option above (not exactly, I made a couple errors) sped execution time ~100x. It still drags toward the end bc it's accumulating to a progressively larger and larger data.table, but it's enough of a boost to make my use case work.

My fork is here: https://github.com/cseveren/gtfs-router - Let me know if you'd like me to submit a pull request. NB: I also added display tracker/counter on original trips.

dhersz commented 2 years ago

Bringing the discussion that arose in r5r's repo to this issue:

The development version of gtfstools also includes a frequencies_to_stop_times() function. It's currently a bit faster than gtfsrouter's:

path <- system.file("extdata/spo_gtfs.zip", package = "gtfstools")

gtfstools_gtfs <- gtfstools::read_gtfs(path)
gtfsrouter_gtfs <- gtfsrouter::extract_gtfs(path, quiet = TRUE)

microbenchmark::microbenchmark(
  cvt_gtfstools_gtfs <- gtfstools::frequencies_to_stop_times(gtfstools_gtfs),
  cvt_gtfsrouter_gtfs <- gtfsrouter::frequencies_to_stop_times(gtfsrouter_gtfs),
  times = 5L
)
#> Unit: milliseconds
#>                                                                           expr
#>     cvt_gtfstools_gtfs <- gtfstools::frequencies_to_stop_times(gtfstools_gtfs)
#>  cvt_gtfsrouter_gtfs <- gtfsrouter::frequencies_to_stop_times(gtfsrouter_gtfs)
#>         min        lq      mean   median         uq        max neval
#>    777.2107   793.785   875.153   925.58   937.2485   941.9406     5
#>  26631.9393 31469.455 31593.529 32363.29 32884.5130 34618.4487     5

Also, there seems to be a bug in gtfsrouter's, since it doesn't update the frequencies table:

cvt_gtfstools_gtfs$frequencies
#> NULL

cvt_gtfsrouter_gtfs$frequencies
#>         trip_id start_time end_time headway_secs
#>   1: CPTM L07-0      14400    17940          720
#>   2: CPTM L07-0      18000    21540          360
#>   3: CPTM L07-0      21600    25140          360
#>   4: CPTM L07-0      25200    28740          360
#>   5: CPTM L07-0      28800    32340          360
#>  ---                                            
#> 700:  5290-10-1      79200    82740         1200
#> 701:  5290-10-1      82800    86340         1200
#> 702:  6450-51-0      18000    21540         3600
#> 703:  6450-51-0      21600    25140         3600
#> 704:  6450-51-0      25200    28740         3600

It would be interesting to hear from @cseveren if gtfstools' function works fine for his use case. I could also work on a PR if @mpadge is happy with my implementation and wants to use it.

P.S.: I don't want to sound arrogant/snob coming to this issue and showing that "my function" is faster, or anything like that. Before implementing frequencies_to_stop_times() in gtfstools I admittedly looked at gtfsrouter implementation and took a lot of inspiration from it. This conversation randomly came up in r5r as I was looking at some old open issues, and I decided that moving it to this issue would be beneficial because I'd be able to properly report what seems to me like a bug and we would be able to discuss both implementations in further details.

cseveren commented 2 years ago

Much faster than my hack-y solution! Took only 20-30 seconds.

mpadge commented 2 years ago

Wow, thanks @cseveren for raising this important issue, and especially thanks to @dhersz for chiming in and offering a clearly superior solution, for which you definitely do not sound arrogant :laughing: I was originally going to just ask for a PR of the gtfstools function into here (mostly because I want to keep this package as light on dependencies as possible, and importing gtfstools would then brings in the ultimately heavy dep of sf, and a few others). But then I had a bit of a look and realised it might help all of us if I try to rewrite the function in C++. The version here was just assembled as a hack, and the slowness is precisely for reasons you asserted @cseveren, but that's something that could be totally circumvented by porting the function to C++.

My suggestion would then be for me to start doing that in a branch here which we can use to compare with the current gtfstools implementation. I'll make sure I do it cpp11-style, rather than Rcpp, so it can easily be plugged straight in there if it turns out to offer significant speed gains. Only catch: I'm about to be away from normal routine for a few weeks, so may have little opportunity before March (or later) to get cracking on this. Hope that's okay!

mpadge commented 2 years ago

Those commits finally finish converting the "frequencies_to_stop_times" routine to C++. It's slightly faster than @dhersz's gtfstools version, although not by a huge amount (around 20%). I think the big advantage is that this operation is much clearer when coded in C++, and so actually takes much less code as well - around 70 lines of very clear C++ plus 78 lines of R, most of which are input checks. This compares with gtfstools R function of 336 lines, and former function here of 170 lines that was (i) confusing, and (ii) didn't work properly.

@dhersz Our two packages do not give identical results, but largely because this procedure requires a few seemingly arbitrary decisions to be made. These notably include:

  1. What to to if "stop_time" for a "frequencies" entry is marginally less than some integral of "start_time + n * headway" - this does happen, and I've opted to include an addiitonal trip in those cases.
  2. What do to when "stop_times" entries for "trip_id" values which are in the frequencies table have "arrival/departure_time" values of "0", yet "trip_id" values flagged in "frequencies" tables with "exact_times" values of 1. These latter values of 1 are supposed to indicate exact arrival/departure times, which should be used as given, but obviously arrival/departure time values of 0 are nonsense. I've opted to ignore "exact_times" flags in "frequencies" tables, becuase these seem to be frequently erroneous.
mpadge commented 2 years ago

Re-opening because there is one pronounced negative effect on other routines. The routine implemented here appends _[0-9]+ values to the end of trip_id entries in the "stop_times" table. The internal routine used to construct timetables relies on matching "trip_id" values, and was modified here to grep matches via an lapply call, because there are commonly too many trip_id values to form a single compound grep pattern. This lapply(ptn, grep) call is, however, really slow, so needs to be replaced with a better way of mapping potentially modified trip_id values back to original, unmodified forms.

mpadge commented 2 years ago

That mention refers to several commits which included erroneous reference to that issue instead of this one

mpadge commented 2 years ago

TODO:

mpadge commented 2 years ago

The above commit converts the C++ code to use vectors throughout instead of nested lists, and speeds up the sample feed by a factor of around 5, so was definitely worth doing.

TODO:

mpadge commented 2 years ago

The above commits should be enough to close this issue. @dhersz Here is a reprex applying the gtfstools routine, and this new C++ one, to the ultimate frequencies feed from Santiago, Chile (almost entirely frequency-based):

library (gtfsrouter)
packageVersion ("gtfsrouter")
#> [1] '0.0.5.123'
path <- "<path>/<to>/santiago-gtfs.zip"
gtfstools_gtfs <- gtfstools::read_gtfs(path)
#> Registered S3 method overwritten by 'gtfsio':
#>   method       from      
#>   summary.gtfs gtfsrouter
gtfsrouter_gtfs <- gtfsrouter::extract_gtfs(path, quiet = TRUE)
#> Warning: This feed contains no transfers.txt 
#>   A transfers.txt table may be constructed with the 'gtfs_transfer_table' function

system.time (
    cvt_gtfstools_gtfs <- gtfstools::frequencies_to_stop_times(gtfstools_gtfs)
    )
#>    user  system elapsed 
#>  65.897   0.434  61.148
nrow (cvt_gtfstools_gtfs$stop_times)
#> [1] 6636076

system.time (
    cvt_gtfsrouter_gtfs <- gtfsrouter::frequencies_to_stop_times(gtfsrouter_gtfs)
)
#>    user  system elapsed 
#>  10.398   0.060  10.423
nrow (cvt_gtfsrouter_gtfs$stop_times)
#> [1] 6262567

Created on 2022-10-05 with reprex v2.0.2

And this new function is around 6 times faster than gtfstools. I also include the sizes of new timetables generated to also indicate that there is something awry with the gtfstools size. This code confirms that the gtfsrouter value is indeed what should be expected (this is taken from the R code here, which relies on using/abusing the permitted "timepoint" flag on GTFS timetables to indicate whether entries are in the "frequencies" table or not):

library (gtfsrouter)
path <- "<path>/<to>/santiago-gtfs.zip"
gtfs <- gtfsrouter::extract_gtfs(path, quiet = TRUE)
gtfs$frequencies [, start_time := rcpp_time_to_seconds (start_time)]
gtfs$frequencies [, end_time := rcpp_time_to_seconds (end_time)]
gtfs$stop_times$timepoint <- 1L
freq_trips <- unique (gtfs$frequencies$trip_id)
gtfs$stop_times$timepoint [which (gtfs$stop_times$trip_id %in% freq_trips)] <- 0L

f_stop_times <- gtfs$stop_times [gtfs$stop_times$timepoint == 0L, ]
gtfs$stop_times <- gtfs$stop_times [gtfs$stop_times$timepoint == 1L, ]

freqs <- gtfs$frequencies
freqs$nseq <- ceiling ((freqs$end_time - freqs$start_time) / freqs$headway_secs)
n <- sum (freqs$nseq)
# plus total numbers of timetable entries:
trip_id_table <- table (f_stop_times$trip_id)
index <- match (freqs$trip_id, names (trip_id_table))
freqs$num_tt_entries <- trip_id_table [index]

num_tt_entries_exp <- sum (freqs$num_tt_entries * freqs$nseq)
num_tt_entries_exp + nrow (gtfs$stop_times)
#> [1] 6262567

Created on 2022-10-05 with reprex v2.0.2

I'll leave open for a few more days, in case anyone wants to try this out and report back, otherwise i'll close soon. Thanks for all the input!

dhersz commented 1 year ago

@mpadge thanks for looking into this issue so thoroughly and sorry for such a late reply - I had been working on another project the last few months and had a real hard time switching contexts. The new gtfsrouter function looks great and I'm glad it's so much faster than gtfstools' and gtfsrouter's previous one! It's great that our packages are pushing each other to be faster and more reliable. I'll try to play with Santiago's GTFS to see why there such a big discrepancy between the results of the two functions.

@dhersz Our two packages do not give identical results, but largely because this procedure requires a few seemingly arbitrary decisions to be made. These notably include:

1. What to to if "stop_time" for a "frequencies" entry is marginally less than some integral of "start_time + n * headway" - this does happen, and I've opted to include an addiitonal trip in those cases.

2. What do to when "stop_times" entries for "trip_id" values which are in the frequencies table have "arrival/departure_time" values of "0", yet "trip_id" values flagged in "frequencies" tables with "exact_times" values of 1. These latter values of 1 are supposed to indicate exact arrival/departure times, which should be used as given, but obviously arrival/departure time values of 0 are nonsense. I've opted to ignore "exact_times" flags in "frequencies" tables, becuase these seem to be frequently erroneous.

If I understand correctly, point 1 can be illustrated with the following example: trip A departs every 7 minutes from 6:00 to 6:30. Assuming departure times should be treated "exactly", we would have a trip at 6:00, 6:07, 6:14, 6:21 amd 6:28. Your solution includes adding a trip departing at 6:30, is that correct?

Currently the gtfstools function also ignores exact_times values. I've reported this in a gtfstools issue (https://github.com/ipeaGIT/gtfstools/issues/56) in which I also propose some "strategies" to deal with exact_times = 0. I'm curious, however, of what you mean by arrival/departure_time values of 0. Are these when these fields are empty, and only the departure/arrival time of the first and last stop of a given trip are explicit, or literally values such as 00:00:00?

mpadge commented 1 year ago

No worries @dhersz, any reply is greatly appreciated no matter how long it might take. So thanks!

If I understand correctly, point 1 can be illustrated with the following example: trip A departs every 7 minutes from 6:00 to 6:30. Assuming departure times should be treated "exactly", we would have a trip at 6:00, 6:07, 6:14, 6:21 amd 6:28. Your solution includes adding a trip departing at 6:30, is that correct?

Yep, that's is precisely what happens, and that is exactly the solution I've opted for.

Currently the gtfstools function also ignores exact_times values. I've reported this in a gtfstools issue (ipeaGIT/gtfstools#56) in which I also propose some "strategies" to deal with exact_times = 0. I'm curious, however, of what you mean by arrival/departure_time values of 0. Are these when these fields are empty, and only the departure/arrival time of the first and last stop of a given trip are explicit, or literally values such as 00:00:00?

Departure values for the first stops of any service described in frequency tables should all be "00:00:00", and generally are. The subsequent times then just specify the travel times from that first stop. And those trips should then have "exact_times = 0" values in the frequencies table, but sometimes have "exact_times = 1". That does not make any sense (excepting of course the extremely rare case where an trip does indeed start at exactly that time). So my proposal is simply to ignore the "exact_times" flags completely. The entries in the frequencies table can be used to populate timetables, and any other entries in "stop_times" that specify times outside the windows given in "frequencies" will still be picked up as effectively having "exact_times = 1". So that flag is really not necessary anyway. Does that make sense?

dhersz commented 1 year ago

No worries @dhersz, any reply is greatly appreciated no matter how long it might take. So thanks!

If I understand correctly, point 1 can be illustrated with the following example: trip A departs every 7 minutes from 6:00 to 6:30. Assuming departure times should be treated "exactly", we would have a trip at 6:00, 6:07, 6:14, 6:21 amd 6:28. Your solution includes adding a trip departing at 6:30, is that correct?

Yep, that's is precisely what happens, and that is exactly the solution I've opted for.

Cool. I don't explicitly add trips at the end_time defined in the frequencies table as you do, but the end_time of a specific frequencies entry frequently is the start_time of the subsequent entry, in which I add a trip. So I don't think that explains such a large difference between the outputs.

Currently the gtfstools function also ignores exact_times values. I've reported this in a gtfstools issue (ipeaGIT/gtfstools#56) in which I also propose some "strategies" to deal with exact_times = 0. I'm curious, however, of what you mean by arrival/departure_time values of 0. Are these when these fields are empty, and only the departure/arrival time of the first and last stop of a given trip are explicit, or literally values such as 00:00:00?

Departure values for the first stops of any service described in frequency tables should all be "00:00:00", and generally are. The subsequent times then just specify the travel times from that first stop. And those trips should then have "exact_times = 0" values in the frequencies table, but sometimes have "exact_times = 1". That does not make any sense (excepting of course the extremely rare case where an trip does indeed start at exactly that time). So my proposal is simply to ignore the "exact_times" flags completely. The entries in the frequencies table can be used to populate timetables, and any other entries in "stop_times" that specify times outside the windows given in "frequencies" will still be picked up as effectively having "exact_times = 1". So that flag is really not necessary anyway. Does that make sense?

I think we have different interpretations of what exact_times = 0/1 means.

First, I didn't know that that the departure time at the first stop of a trip described in the frequencies table should be 00:00:00. I checked the specification and there doesn't seem to be any spec about that in it, but there's in fact a recommendation made at GTFS Best Practices, by MobilityData. Cool! To be honest, in my experience I've never seen feeds abiding to this practice, but it's good to know about it.

In my understanding, the exact_times value controls whether the headway is precisely taken into account, not the travel times between stops. Again in my understanding, the travel time between stops should always be considered exactly as described (like your implementation does consider), and whether this departure and arrival times are in practice strictly adhered to is controlled by the timepoint field.

Quoting the spec:

Frequency-based service (exact_times=0) in which service does not follow a fixed schedule throughout the day. Instead, operators attempt to strictly maintain predetermined headways for trips. A compressed representation of schedule-based service (exact_times=1) that has the exact same headway for trips over specified time period(s). In schedule-based service operators try to strictly adhere to a schedule.

So, for example, a trip whose frequencies entry is start_time = 04:00:00, end_time = 05:00:00, headway = 12 minutes and exact_times = 1 should be the equivalent of trips departing at the first stop at 04:00, 04:12, 04:24, ..., 04:48, 05:00.

If the same entry had exact_times = 0 the operators would not necessarily be able to maintain the 12 minutes headway exactly, so we could have trips departing at, let's say 04:00:15, 04:13:00, 04:24:30, ..., 04:47:00, 04:59:50. The gtfstools issue I linked above refers exactly to this "random" aspect of exact_times = 0, and possible strategies to use when this value occurs.

Summarizing all of this, I'd like to re-state that this is my interpretation, and would like to know if you agree with it. Currently I think both our implementations should yield near identical results (with the exception of same entries whose subsequent frequencies entry does not start exactly when the previous one stops), so I'll investigate further the difference between their results.

dhersz commented 1 year ago

I haven't yet tested with Santiago's GTFS, but I'm playing with SPTrans' GTFS (shipped with gtfstools) and I'm already seeing some discrepancies between gtfsrouter and gtfstools function. Opposite to Santiago's case, however, in my tests gtfsrouter function is returning a much larger stop_times table.

Some bugs/problems I'm currently seeing with the functions:

data_path <- system.file("extdata/spo_gtfs.zip", package = "gtfstools")

gtfs <- gtfsrouter::extract_gtfs(data_path)
#> ▶ Unzipping GTFS archive✔ Unzipped GTFS archive
#> Warning: This feed contains no transfers.txt 
#>   A transfers.txt table may be constructed with the 'gtfs_transfer_table' function
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds

converted <- gtfsrouter::frequencies_to_stop_times(gtfs)

head(converted$stop_times)
#>          trip_id arrival_time departure_time stop_id stop_sequence
#> 1: CPTM L07-0_f0        28800          28800   18940             1
#> 2: CPTM L07-0_f0        29280          29280   18920             2
#> 3: CPTM L07-0_f0        29760          29760   18919             3
#> 4: CPTM L07-0_f0        30240          30240   18917             4
#> 5: CPTM L07-0_f0        30720          30720   18916             5
#> 6: CPTM L07-0_f0        31200          31200   18965             6
head(converted$frequencies)
#>       trip_id start_time end_time headway_secs
#> 1: CPTM L07-0      14400    17940          720
#> 2: CPTM L07-0      18000    21540          360
#> 3: CPTM L07-0      21600    25140          360
#> 4: CPTM L07-0      25200    28740          360
#> 5: CPTM L07-0      28800    32340          360
#> 6: CPTM L07-0      32400    35940          480

converted$stop_times[trip_id == "CPTM L07-0"]
#> Empty data.table (0 rows and 5 cols): trip_id,arrival_time,departure_time,stop_id,stop_sequence
head(converted$stop_times, 36)
#>           trip_id arrival_time departure_time stop_id stop_sequence
#>  1: CPTM L07-0_f0        28800          28800   18940             1
#>  2: CPTM L07-0_f0        29280          29280   18920             2
#>  3: CPTM L07-0_f0        29760          29760   18919             3
#>  4: CPTM L07-0_f0        30240          30240   18917             4
#>  5: CPTM L07-0_f0        30720          30720   18916             5
#>  6: CPTM L07-0_f0        31200          31200   18965             6
#>  7: CPTM L07-0_f0        31680          31680   18923             7
#>  8: CPTM L07-0_f0        32160          32160   18922             8
#>  9: CPTM L07-0_f0        32640          32640 4114459             9
#> 10: CPTM L07-0_f0        33120          33120   18921            10
#> 11: CPTM L07-0_f0        33600          33600   18924            11
#> 12: CPTM L07-0_f0        34080          34080   18925            12
#> 13: CPTM L07-0_f0        34560          34560   18926            13
#> 14: CPTM L07-0_f0        35040          35040   18971            14
#> 15: CPTM L07-0_f0        35520          35520   18972            15
#> 16: CPTM L07-0_f0        36000          36000   18973            16
#> 17: CPTM L07-0_f0        36480          36480   18974            17
#> 18: CPTM L07-0_f0        36960          36960   18975            18
#> 19: CPTM L07-0_f0        32400          32400   18940             1
#> 20: CPTM L07-0_f0        32880          32880   18920             2
#> 21: CPTM L07-0_f0        33360          33360   18919             3
#> 22: CPTM L07-0_f0        33840          33840   18917             4
#> 23: CPTM L07-0_f0        34320          34320   18916             5
#> 24: CPTM L07-0_f0        34800          34800   18965             6
#> 25: CPTM L07-0_f0        35280          35280   18923             7
#> 26: CPTM L07-0_f0        35760          35760   18922             8
#> 27: CPTM L07-0_f0        36240          36240 4114459             9
#> 28: CPTM L07-0_f0        36720          36720   18921            10
#> 29: CPTM L07-0_f0        37200          37200   18924            11
#> 30: CPTM L07-0_f0        37680          37680   18925            12
#> 31: CPTM L07-0_f0        38160          38160   18926            13
#> 32: CPTM L07-0_f0        38640          38640   18971            14
#> 33: CPTM L07-0_f0        39120          39120   18972            15
#> 34: CPTM L07-0_f0        39600          39600   18973            16
#> 35: CPTM L07-0_f0        40080          40080   18974            17
#> 36: CPTM L07-0_f0        40560          40560   18975            18
#>           trip_id arrival_time departure_time stop_id stop_sequence
head(converted$trips)
#>    route_id service_id    trip_id trip_headsign direction_id shape_id
#> 1: CPTM L07        USD CPTM L07-0       JUNDIAI            0    17846
#> 2: CPTM L07        USD CPTM L07-1           LUZ            1    17847
#> 3: CPTM L08        USD CPTM L08-0  AMADOR BUENO            0    17848
#> 4: CPTM L08        USD CPTM L08-1 JULIO PRESTES            1    17849
#> 5: CPTM L09        USD CPTM L09-0        GRAJAU            0    17850
#> 6: CPTM L09        USD CPTM L09-1        OSASCO            1    17851
mpadge commented 1 year ago

Thanks @dhersz, I'm addressing most of those issues now. Before i start ...

gtfstools' is spending more than half of the total processing time converting times in seconds to times in HH:MM:SS format (i.e 0 -> "00:00:00", 3600 -> "01:00:00"). This function is really simple, so should definitely be optimized.

gtfsrouter just uses hms for that:

times <- as.integer (runif (1e6, 0, 3600 * 12 - 1))
b <- bench::mark (
    hms::hms (times),
    gtfstools:::seconds_to_string (times),
    check = FALSE,
    time_unit = "ms"
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
b [, 1:5]
#> # A tibble: 2 × 5
#>   expression                              min median `itr/sec` mem_alloc
#>   <bch:expr>                            <dbl>  <dbl>     <dbl> <bch:byt>
#> 1 hms::hms(times)                        1.48   1.67    328.      8.32MB
#> 2 gtfstools:::seconds_to_string(times) 961.   961.        1.04   135.7MB
b$median [2] / b$median [1]
#> [1] 575.8945

Created on 2022-12-14 with reprex v2.0.2

I'd suggest doing the same. C++ string manipulation like you've got in gtfstools is not efficient at all - like 600 times less efficient than hms (which is pure R!)

mpadge commented 1 year ago

TODO

@dhersz Note that the frequencies table is kept for the moment, because i need to figure out how it's going to be used to re-generate timetables based on random headways.