Closed FlxPo closed 5 months ago
Thanks @FlxPo. Strange that you're getting out-of-memory errors. That shouldn't happen, and definitely not from distance calculations, which are all internalised here in C++ code, and don't use geodist
at all. (It's only used in other functions, not transfer table calculation.) I routinely calculate transfer tables between over 30,000 stops and have no issues. I'll try on the feed you've linked to and see what i get ...
I just realized I was using an old version of gtfsrouter, I will do some tests to see if I still get an error.
No, the error is real. It comes from data.table
. I've already fixed it, and will push in a moment
Let me know if that's okay now. See Rdatatable/data.table issue 5676 for background - data.table
has a pile of issues, of which this is clearly one, and will slowly start moving towards being better maintained soon.
I'm always interested in opportunities for sf
comparisons, and so here it is:
library (gtfsrouter)
library(data.table)
transfer_table <- function(gtfs, d_limit = 200, crs = 2154) {
# ... your function from above ...
}
path <- "./feeds/helsinki.zip"
gtfs <- extract_gtfs (path)
#> ▶ Unzipping GTFS archive✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds
system.time (
t1 <- gtfs_transfer_table (gtfs, d_limit = 200)
)
#> user system elapsed
#> 2.131 0.004 2.055
system.time (
t2 <- transfer_table(gtfs, d_limit = 200)
)
#> user system elapsed
#> 95.689 0.449 95.685
Created on 2024-01-31 with reprex v2.1.0
So basically 50 times faster than sf
.
I just tested the latest development version on the [Helsinki GTFS](https://www.hsl.fi/en/hsl/open-data, is this the one you use ?). This is weird, my function seems a little bit faster (and I cannot reproduce the 50x difference).
library(gtfsrouter)
library(sf)
#> Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
library(data.table)
transfer_table <- function(gtfs, d_limit = 200, crs = 2154) {
# Convert the GTFS stops data.table to sf to be able perform spatial operations efficiently
stops_xy <- sfheaders::sf_point(gtfs$stops, x = "stop_lon", y = "stop_lat", keep = TRUE)
stops_xy$stop_index <- 1:nrow(stops_xy)
# Project the data according to the local CRS
st_crs(stops_xy) <- 4326
stops_xy <- st_transform(stops_xy, crs)
# Find which stops are within d_limit meters of each stop
stops_xy_buffer <- st_buffer(stops_xy, 400)
intersects <- st_intersects(stops_xy, stops_xy_buffer)
intersects <- lapply(seq_along(intersects), function(i) {
data.table(stop_index = i, neighbor_stop_index = intersects[[i]])
})
intersects <- rbindlist(intersects)
intersects <- intersects[stop_index != neighbor_stop_index]
# Compute the crow fly distance and travel times between neighboring stops
# (travel times formula is based on a linear model fitted on IDFM data)
stops_xy_coords <- as.data.table(st_coordinates(stops_xy))
stops_xy_coords[, stop_index := stops_xy$stop_index]
transfers <- merge(intersects, stops_xy_coords, by = "stop_index")
transfers <- merge(transfers, stops_xy_coords, by.x = "neighbor_stop_index", by.y = "stop_index", suffixes = c("_from", "_to"))
transfers[, distance := sqrt((Y_to - Y_from)^2 + (X_to - X_from)^2)]
transfers[, min_transfer_time := 31 + 1.125*distance]
transfers[, transfer_type := 2]
# Format the data as expected for the transfers table
stops_index_id <- as.data.table(st_drop_geometry(stops_xy[, c("stop_id", "stop_index")]))
transfers <- merge(transfers, stops_index_id, by = "stop_index")
transfers <- merge(transfers, stops_index_id, by.x = "neighbor_stop_index", by.y = "stop_index", suffixes = c("_from", "_to"))
setnames(transfers, c("stop_id_from", "stop_id_to"), c("from_stop_id", "to_stop_id"))
transfers <- transfers[, list(from_stop_id, to_stop_id, min_transfer_time)]
return(transfers)
}
gtfs_file_path <- "D:/data/mobility/data/gtfs/hsl.zip"
gtfs <- extract_gtfs(gtfs_file_path)
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds
system.time (
t1 <- gtfs_transfer_table (gtfs, d_limit = 200)
)
#> utilisateur système écoulé
#> 4.62 0.03 4.66
system.time (
t2 <- transfer_table(gtfs, d_limit = 200)
)
#> utilisateur système écoulé
#> 1.76 0.12 1.90
Created on 2024-01-31 with reprex v2.1.0
Oh, the difference is that i commented out the st_transform(stops_xy, 2154)
line, because distances can and should be calculated directly on the ellipse in 4326 coordinates, which is what gtfsrouter does. Re-projecting on to a plane is obviously much faster, but there is no universal way to do that that won't decrease accuracy of the routine. If you remove that line (and add a d_limit <- unit::set_units(d_limit, m)
) then you should see similar times to my example.
Thanks for the precision. For the record I'm not sure I see why distances should be calculated from lat/lon coordinates (at least at a regional scale), and I wouldn't mind specifying the CRS for each of my projects.
I tried to use gtfs_transfer_table on a big GTFS dataset, with 46 000 stops (from IDFM, for the Ile-de-France region : https://eu.ftp.opendatasoft.com/stif/GTFS/IDFM-gtfs.zip). A transfer table is already provided, but i needed to recompute them to be able to merge this feed with other feeds.
For now gtfs_transfer_table seems to compute all pairwise distances between stops with geodist, which results in a out of memory error given the high number of stops.
Here is an alternative approach using sf which is quite fast, adapted to my needs but that could return exactly the same thing as gtfs_transfer_table :
It would require taking a dependency on sf.