UrbanAnalyst / gtfsrouter

Routing and analysis engine for GTFS (General Transit Feed Specification) data
https://urbananalyst.github.io/gtfsrouter/
80 stars 17 forks source link

Alternative approach to build a transfer table #114

Closed FlxPo closed 5 months ago

FlxPo commented 5 months ago

I tried to use gtfs_transfer_table on a big GTFS dataset, with 46 000 stops (from IDFM, for the Ile-de-France region : https://eu.ftp.opendatasoft.com/stif/GTFS/IDFM-gtfs.zip). A transfer table is already provided, but i needed to recompute them to be able to merge this feed with other feeds.

For now gtfs_transfer_table seems to compute all pairwise distances between stops with geodist, which results in a out of memory error given the high number of stops.

Here is an alternative approach using sf which is quite fast, adapted to my needs but that could return exactly the same thing as gtfs_transfer_table :

transfer_table <- function(gtfs, d_limit = 200, crs = 2154) {

  # Convert the GTFS stops data.table to sf to be able perform spatial operations efficiently
  stops_xy <- sfheaders::sf_point(gtfs$stops, x = "stop_lon", y = "stop_lat", keep = TRUE)
  stops_xy$stop_index <- 1:nrow(stops_xy)

  # Project the data according to the local CRS
  st_crs(stops_xy) <- 4326
  stops_xy <- st_transform(stops_xy, crs)

  # Find which stops are within d_limit meters of each stop
  stops_xy_buffer <- st_buffer(stops_xy, 400)

  intersects <- st_intersects(stops_xy, stops_xy_buffer)

  intersects <- lapply(seq_along(intersects), function(i) {
    data.table(stop_index = i, neighbor_stop_index = intersects[[i]])
  })

  intersects <- rbindlist(intersects)
  intersects <- intersects[stop_index != neighbor_stop_index]

  # Compute the crow fly distance and travel times between neighboring stops
  # (travel times formula is based on a linear model fitted on IDFM data)
  stops_xy_coords <- as.data.table(st_coordinates(stops_xy))
  stops_xy_coords[, stop_index := stops_xy$stop_index]

  transfers <- merge(intersects, stops_xy_coords, by = "stop_index")
  transfers <- merge(transfers, stops_xy_coords, by.x = "neighbor_stop_index", by.y = "stop_index", suffixes = c("_from", "_to"))

  transfers[, distance := sqrt((Y_to - Y_from)^2 + (X_to - X_from)^2)]
  transfers[, min_transfer_time := 31 + 1.125*distance]
  transfers[, transfer_type := 2]

  # Format the data as expected for the transfers table
  stops_index_id <- as.data.table(st_drop_geometry(stops_xy[, c("stop_id", "stop_index")]))

  transfers <- merge(transfers, stops_index_id, by = "stop_index")
  transfers <- merge(transfers, stops_index_id, by.x = "neighbor_stop_index", by.y = "stop_index", suffixes = c("_from", "_to"))

  setnames(transfers, c("stop_id_from", "stop_id_to"), c("from_stop_id", "to_stop_id"))

  transfers <- transfers[, list(from_stop_id, to_stop_id, min_transfer_time)]

  return(transfers)

}

It would require taking a dependency on sf.

mpadge commented 5 months ago

Thanks @FlxPo. Strange that you're getting out-of-memory errors. That shouldn't happen, and definitely not from distance calculations, which are all internalised here in C++ code, and don't use geodist at all. (It's only used in other functions, not transfer table calculation.) I routinely calculate transfer tables between over 30,000 stops and have no issues. I'll try on the feed you've linked to and see what i get ...

FlxPo commented 5 months ago

I just realized I was using an old version of gtfsrouter, I will do some tests to see if I still get an error.

mpadge commented 5 months ago

No, the error is real. It comes from data.table. I've already fixed it, and will push in a moment

mpadge commented 5 months ago

Let me know if that's okay now. See Rdatatable/data.table issue 5676 for background - data.table has a pile of issues, of which this is clearly one, and will slowly start moving towards being better maintained soon.

mpadge commented 5 months ago

I'm always interested in opportunities for sf comparisons, and so here it is:

library (gtfsrouter)
library(data.table)

transfer_table <- function(gtfs, d_limit = 200, crs = 2154) {
  # ... your function from above ...
}

path <- "./feeds/helsinki.zip"
gtfs <- extract_gtfs (path)
#> ▶ Unzipping GTFS archive✔ Unzipped GTFS archive  
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds

system.time (
    t1 <- gtfs_transfer_table (gtfs, d_limit = 200)
)
#>    user  system elapsed 
#>   2.131   0.004   2.055

system.time (
    t2 <- transfer_table(gtfs, d_limit = 200)
)
#>    user  system elapsed 
#>  95.689   0.449  95.685

Created on 2024-01-31 with reprex v2.1.0

So basically 50 times faster than sf.

FlxPo commented 5 months ago

I just tested the latest development version on the [Helsinki GTFS](https://www.hsl.fi/en/hsl/open-data, is this the one you use ?). This is weird, my function seems a little bit faster (and I cannot reproduce the 50x difference).

library(gtfsrouter)
library(sf)
#> Linking to GEOS 3.11.2, GDAL 3.7.2, PROJ 9.3.0; sf_use_s2() is TRUE
library(data.table)

transfer_table <- function(gtfs, d_limit = 200, crs = 2154) {

  # Convert the GTFS stops data.table to sf to be able perform spatial operations efficiently
  stops_xy <- sfheaders::sf_point(gtfs$stops, x = "stop_lon", y = "stop_lat", keep = TRUE)
  stops_xy$stop_index <- 1:nrow(stops_xy)

  # Project the data according to the local CRS
  st_crs(stops_xy) <- 4326
  stops_xy <- st_transform(stops_xy, crs)

  # Find which stops are within d_limit meters of each stop
  stops_xy_buffer <- st_buffer(stops_xy, 400)

  intersects <- st_intersects(stops_xy, stops_xy_buffer)

  intersects <- lapply(seq_along(intersects), function(i) {
    data.table(stop_index = i, neighbor_stop_index = intersects[[i]])
  })

  intersects <- rbindlist(intersects)
  intersects <- intersects[stop_index != neighbor_stop_index]

  # Compute the crow fly distance and travel times between neighboring stops
  # (travel times formula is based on a linear model fitted on IDFM data)
  stops_xy_coords <- as.data.table(st_coordinates(stops_xy))
  stops_xy_coords[, stop_index := stops_xy$stop_index]

  transfers <- merge(intersects, stops_xy_coords, by = "stop_index")
  transfers <- merge(transfers, stops_xy_coords, by.x = "neighbor_stop_index", by.y = "stop_index", suffixes = c("_from", "_to"))

  transfers[, distance := sqrt((Y_to - Y_from)^2 + (X_to - X_from)^2)]
  transfers[, min_transfer_time := 31 + 1.125*distance]
  transfers[, transfer_type := 2]

  # Format the data as expected for the transfers table
  stops_index_id <- as.data.table(st_drop_geometry(stops_xy[, c("stop_id", "stop_index")]))

  transfers <- merge(transfers, stops_index_id, by = "stop_index")
  transfers <- merge(transfers, stops_index_id, by.x = "neighbor_stop_index", by.y = "stop_index", suffixes = c("_from", "_to"))

  setnames(transfers, c("stop_id_from", "stop_id_to"), c("from_stop_id", "to_stop_id"))

  transfers <- transfers[, list(from_stop_id, to_stop_id, min_transfer_time)]

  return(transfers)

}

gtfs_file_path <- "D:/data/mobility/data/gtfs/hsl.zip"

gtfs <- extract_gtfs(gtfs_file_path)
#> ▶ Unzipping GTFS archive
#> ✔ Unzipped GTFS archive
#> ▶ Extracting GTFS feed✔ Extracted GTFS feed 
#> ▶ Converting stop times to seconds✔ Converted stop times to seconds 
#> ▶ Converting transfer times to seconds✔ Converted transfer times to seconds

system.time (
  t1 <- gtfs_transfer_table (gtfs, d_limit = 200)
)
#> utilisateur     système      écoulé 
#>        4.62        0.03        4.66

system.time (
  t2 <- transfer_table(gtfs, d_limit = 200)
)
#> utilisateur     système      écoulé 
#>        1.76        0.12        1.90

Created on 2024-01-31 with reprex v2.1.0
mpadge commented 5 months ago

Oh, the difference is that i commented out the st_transform(stops_xy, 2154) line, because distances can and should be calculated directly on the ellipse in 4326 coordinates, which is what gtfsrouter does. Re-projecting on to a plane is obviously much faster, but there is no universal way to do that that won't decrease accuracy of the routine. If you remove that line (and add a d_limit <- unit::set_units(d_limit, m)) then you should see similar times to my example.

FlxPo commented 5 months ago

Thanks for the precision. For the record I'm not sure I see why distances should be calculated from lat/lon coordinates (at least at a regional scale), and I wouldn't mind specifying the CRS for each of my projects.