ITSLeeds / od

An R package for working with origin-destination data
https://itsleeds.github.io/od/
GNU General Public License v3.0
33 stars 4 forks source link

maxdist argument for points_to_od #18

Closed mem48 closed 1 month ago

mem48 commented 4 years ago

Happy to contribute this if you think it would be useful. The ideas is that you have points and want to make an OD datasets but cap the maximum distance considered. e.g. how we capped the PCT at 30 km.

Would need to add geodists as a dependancy.

Robinlovelace commented 4 years ago

Definitely a good plan and I think geodists is worth adding as a dependency. :+1: from me.

mem48 commented 4 years ago

nngeo might be better, I'll have to benchmark

Robinlovelace commented 4 years ago

Sounds good @mem48. I've assigned you for now, I'm happy to take a look. Could ask the package author for his views, Michael and I have communicated on other matters.

@michaeldorman what are your thoughts on geodist vs nngeo for rapidly calculating distances between pairs of points?

Note I've used geodists to great effect to calculate sequential (cumulative) distances along traces but don't think we need that here.

michaeldorman commented 4 years ago

@Robinlovelace I think that geodist is more specialized on pairwise distances between lon-lat points (using several possible methods) and more highly optimized. In nngeo I tried to cover all possible cases of looking for nearest geometries, while using the most accurate method that's still fast:

  1. geographicLib C code, for lon-lat point (which is also the accurate option provided in geodist)
  2. RANN, for projected points
  3. sf::st_distance for all other cases

If you need all pairwise distances between lon-lat points perhaps geodist is more suitable because it's more specialized to do that and probably more optimized. If you need to search for nearest geometries and/or allow cases other than lon-lat points then nngeo may be suitable.

Here is a reproducible example showing that, for the lon-lat points case, results are identical and speed is comparable though a little faster in geodist. (benchmark and equality test can be added to check more thoroughly).

Also please note that nngeo returns ordered distances (and indices) from nearest to furthest; to get the pairwise distance matrix the result needs to be rearranged. geodist returns the pairwise matrix as-is.

If I can help with anything please let me know.

# Sample data
library(sf)
n = 1000
x = data.frame(
  lon = (runif(n) * 2 - 1) * 70,
  lat = (runif(n) * 2 - 1) * 70
)
x = st_as_sf(x, coords = c("lon", "lat"), crs = 4326)

# nngeo
library(nngeo)
start = Sys.time()
result1 = st_nn(x, x, k = 1000, returnDist = TRUE, progress = FALSE)
end = Sys.time()
end - start
## Time difference of 1.580894 secs
# geodist
library(geodist)
start = Sys.time()
result2 = geodist(st_coordinates(x), st_coordinates(x), measure = "geodesic")
end = Sys.time()
end - start
## Time difference of 1.315116 secs
# Comapre
result1$dist[[1]] %>% head
## [1]      0.0 107054.3 213056.5 316968.2 413970.4 432763.1
result2[1,] %>% sort %>% head
## [1]      0.0 107054.3 213056.5 316968.2 413970.4 432763.1
Robinlovelace commented 4 years ago

Many thanks @michaeldorman that is very useful. It sounds like geodist is more appropriate for this specific use case although I can think of applications of nngeo. In stplanr I've used the nabor package but it doesn't work with geographic projections so I may switch.

mem48 commented 4 years ago

What about when you want to set a max distance, in the context of this points_to_od function I think you would need to calcualte all distances with godists and then subset to the ones you wanted. Does nngeo do the same, or does the search stop over the maximum distance?

michaeldorman commented 4 years ago

nngeo does the same: all distances are calculated, those below the maxdist are retained. The exception is when both inputs are projected points, in which case the RANN:nn2 method is used. I believe this method does not calculate all distances, and in any case the calculation is usually instantaneous.

mem48 commented 4 years ago

Thanks for the clarification, I've used RANN to which is why I though you might implement something similar for lat/long.

Out of interest have you benchmarked projected vs geographic coordinates? I wonder if the max distance was low you might get a big performance gain by using RANN to get an approximate set, and then doing a more accurate second pass.

mem48 commented 4 years ago

Also I believe nabor is faster than RANN

michaeldorman commented 4 years ago

The help page for nngeo::st_nn has examples with projected and geographic coordinates; haven't thoroughly benchmarked it, but RANN is probably faster than any accurate calculation using geographic coordinates:

library(nngeo)

# Large example - Geo points
n = 1000
x = data.frame(
  lon = (runif(n) * 2 - 1) * 70,
  lat = (runif(n) * 2 - 1) * 70
)
x = st_as_sf(x, coords = c("lon", "lat"), crs = 4326)
start = Sys.time()
result1 = st_nn(x, x, k = 3)
end = Sys.time()
end - start
## Time difference of 2.405991 secs

# Large example - Proj points
n = 1000
x = data.frame(
  lon = (runif(n) * 2 - 1) * 70,
  lat = (runif(n) * 2 - 1) * 70
)
x = st_as_sf(x, coords = c("lon", "lat"), crs = 4326)
x = st_transform(x, 32630)
start = Sys.time()
result = st_nn(x, x, k = 3)
end = Sys.time()
end - start
## Time difference of 0.9784892 secs

There is also an option for parallel processing, which can help reduce the time of calculations using geographic coordinates:

# Large example - Geo points - Parallel processing
start = Sys.time()
result2 = st_nn(x, x, k = 3, parallel = 4)
end = Sys.time()
end - start
## Time difference of 1.429008 secs

Not sure I understand your idea to use RANN to get an approximate set, and then doing a more accurate second pass, will be happy to hear any details.

mem48 commented 4 years ago

I think it would be clearer if I explained the specific problem I'm trying to solve.

I want to create a travel time matrix for the UK; I have about 200,000 points so potentially 40 billion Origin-Destination pairs, which would be a very large data frame. For my use case, I'm only really interested in commuting, so I would like a 100 km distance cap, which should reduce the problem to a few 10's of millions, large but doable.

If I use geodist I would have to calculate all 40 billion distances and then subset. So I will probably run out of memory. But if I project and use RANN I can calculate far fewer values and not run out of memory.

So problem solved, as long as I have a good choice of National Grid. But suppose I now wanted to do the same thing globally, but with a similar short maximum distance. I'd have to use geographic coordinates.

But you could treat the lat/lng coordinates as simple x,y coordinates and pass them to RANN, the distances would be in degrees, not meters, and the accuracy would change around the world. But I thought that you might be able to take the bounding box for the input and do a simple calculation to convert the maximum distance in metres to a worst-case maximum distance in degrees. You could then pass that value to RANN and remove 90% of the possible measurements. Then use a more accurate method to do a second filter but on a much smaller set of values. Thus, preventing the problem from running out of memory.

This method would not work near the poles or 180-degree line of longitude, but I expect there are lots of real-world problems that would fit those restrictions. It also wouldn't be worth the effort except for the very largest datasets, but I think it would not be too hard to come up with some simple tests to decide whether to implement the method.

I tweaked your example datasets notice how big the time difference gets

library(nngeo)
library(sf)

# Large example - Geo points
n = 10000
x = data.frame(
  lon = (runif(n) * 2 - 1) * 70,
  lat = (runif(n) * 2 - 1) * 70
)
x = st_as_sf(x, coords = c("lon", "lat"), crs = 4326)
start = Sys.time()
result1 = st_nn(x, x, k = nrow(x), maxdist = 100000)
end = Sys.time()
end - start
## Time difference of 5.62486 mins

# Large example - Proj points
y = st_transform(x, 32630)
start = Sys.time()
result2 = st_nn(y, y, k = nrow(y), maxdist = 100000)
end = Sys.time()
end - start
## Time difference of 57.32071 secs
Robinlovelace commented 4 years ago

Great to see benchmarks, how does the second example compare with geodists?

michaeldorman commented 4 years ago

I understand, thank you for the clarification! Eventually whether it's worthwhile depends on cost-benefit of writing the function as opposed to waiting for the calculation to finish (and how often you need to do that calculation).

By the way, geodist (and perhaps other libraries) have less accurate and faster distance metrics for geographic coordinates. Perhaps these can be used for the initial filtering, instead of RANN, in order to solve the 180-degree line issue.

Robinlovelace commented 4 years ago

That sounds like a good plan @michaeldorman, accuracy isn't important in the first filter.

Robinlovelace commented 1 month ago

Reviving with reference to https://github.com/Robinlovelace/simodels/pull/33

Thanks @mem48 for nudge on this.