Closed mem48 closed 1 month ago
Definitely a good plan and I think geodists
is worth adding as a dependency. :+1: from me.
Sounds good @mem48. I've assigned you for now, I'm happy to take a look. Could ask the package author for his views, Michael and I have communicated on other matters.
@michaeldorman what are your thoughts on geodist
vs nngeo
for rapidly calculating distances between pairs of points?
Note I've used geodists
to great effect to calculate sequential (cumulative) distances along traces but don't think we need that here.
@Robinlovelace I think that geodist
is more specialized on pairwise distances between lon-lat points (using several possible methods) and more highly optimized. In nngeo
I tried to cover all possible cases of looking for nearest geometries, while using the most accurate method that's still fast:
geodist
)RANN
, for projected pointssf::st_distance
for all other casesIf you need all pairwise distances between lon-lat points perhaps geodist
is more suitable because it's more specialized to do that and probably more optimized. If you need to search for nearest geometries and/or allow cases other than lon-lat points then nngeo
may be suitable.
Here is a reproducible example showing that, for the lon-lat points case, results are identical and speed is comparable though a little faster in geodist
. (benchmark and equality test can be added to check more thoroughly).
Also please note that nngeo
returns ordered distances (and indices) from nearest to furthest; to get the pairwise distance matrix the result needs to be rearranged. geodist
returns the pairwise matrix as-is.
If I can help with anything please let me know.
# Sample data
library(sf)
n = 1000
x = data.frame(
lon = (runif(n) * 2 - 1) * 70,
lat = (runif(n) * 2 - 1) * 70
)
x = st_as_sf(x, coords = c("lon", "lat"), crs = 4326)
# nngeo
library(nngeo)
start = Sys.time()
result1 = st_nn(x, x, k = 1000, returnDist = TRUE, progress = FALSE)
end = Sys.time()
end - start
## Time difference of 1.580894 secs
# geodist
library(geodist)
start = Sys.time()
result2 = geodist(st_coordinates(x), st_coordinates(x), measure = "geodesic")
end = Sys.time()
end - start
## Time difference of 1.315116 secs
# Comapre
result1$dist[[1]] %>% head
## [1] 0.0 107054.3 213056.5 316968.2 413970.4 432763.1
result2[1,] %>% sort %>% head
## [1] 0.0 107054.3 213056.5 316968.2 413970.4 432763.1
Many thanks @michaeldorman that is very useful. It sounds like geodist
is more appropriate for this specific use case although I can think of applications of nngeo. In stplanr
I've used the nabor
package but it doesn't work with geographic projections so I may switch.
What about when you want to set a max distance, in the context of this points_to_od
function I think you would need to calcualte all distances with godists
and then subset to the ones you wanted. Does nngeo
do the same, or does the search stop over the maximum distance?
nngeo
does the same: all distances are calculated, those below the maxdist
are retained. The exception is when both inputs are projected points, in which case the RANN:nn2
method is used. I believe this method does not calculate all distances, and in any case the calculation is usually instantaneous.
Thanks for the clarification, I've used RANN to which is why I though you might implement something similar for lat/long.
Out of interest have you benchmarked projected vs geographic coordinates? I wonder if the max distance was low you might get a big performance gain by using RANN to get an approximate set, and then doing a more accurate second pass.
Also I believe nabor is faster than RANN
The help page for nngeo::st_nn
has examples with projected and geographic coordinates; haven't thoroughly benchmarked it, but RANN
is probably faster than any accurate calculation using geographic coordinates:
library(nngeo)
# Large example - Geo points
n = 1000
x = data.frame(
lon = (runif(n) * 2 - 1) * 70,
lat = (runif(n) * 2 - 1) * 70
)
x = st_as_sf(x, coords = c("lon", "lat"), crs = 4326)
start = Sys.time()
result1 = st_nn(x, x, k = 3)
end = Sys.time()
end - start
## Time difference of 2.405991 secs
# Large example - Proj points
n = 1000
x = data.frame(
lon = (runif(n) * 2 - 1) * 70,
lat = (runif(n) * 2 - 1) * 70
)
x = st_as_sf(x, coords = c("lon", "lat"), crs = 4326)
x = st_transform(x, 32630)
start = Sys.time()
result = st_nn(x, x, k = 3)
end = Sys.time()
end - start
## Time difference of 0.9784892 secs
There is also an option for parallel processing, which can help reduce the time of calculations using geographic coordinates:
# Large example - Geo points - Parallel processing
start = Sys.time()
result2 = st_nn(x, x, k = 3, parallel = 4)
end = Sys.time()
end - start
## Time difference of 1.429008 secs
Not sure I understand your idea to use RANN
to get an approximate set, and then doing a more accurate second pass, will be happy to hear any details.
I think it would be clearer if I explained the specific problem I'm trying to solve.
I want to create a travel time matrix for the UK; I have about 200,000 points so potentially 40 billion Origin-Destination pairs, which would be a very large data frame. For my use case, I'm only really interested in commuting, so I would like a 100 km distance cap, which should reduce the problem to a few 10's of millions, large but doable.
If I use geodist I would have to calculate all 40 billion distances and then subset. So I will probably run out of memory. But if I project and use RANN I can calculate far fewer values and not run out of memory.
So problem solved, as long as I have a good choice of National Grid. But suppose I now wanted to do the same thing globally, but with a similar short maximum distance. I'd have to use geographic coordinates.
But you could treat the lat/lng coordinates as simple x,y coordinates and pass them to RANN, the distances would be in degrees, not meters, and the accuracy would change around the world. But I thought that you might be able to take the bounding box for the input and do a simple calculation to convert the maximum distance in metres to a worst-case maximum distance in degrees. You could then pass that value to RANN and remove 90% of the possible measurements. Then use a more accurate method to do a second filter but on a much smaller set of values. Thus, preventing the problem from running out of memory.
This method would not work near the poles or 180-degree line of longitude, but I expect there are lots of real-world problems that would fit those restrictions. It also wouldn't be worth the effort except for the very largest datasets, but I think it would not be too hard to come up with some simple tests to decide whether to implement the method.
I tweaked your example datasets notice how big the time difference gets
library(nngeo)
library(sf)
# Large example - Geo points
n = 10000
x = data.frame(
lon = (runif(n) * 2 - 1) * 70,
lat = (runif(n) * 2 - 1) * 70
)
x = st_as_sf(x, coords = c("lon", "lat"), crs = 4326)
start = Sys.time()
result1 = st_nn(x, x, k = nrow(x), maxdist = 100000)
end = Sys.time()
end - start
## Time difference of 5.62486 mins
# Large example - Proj points
y = st_transform(x, 32630)
start = Sys.time()
result2 = st_nn(y, y, k = nrow(y), maxdist = 100000)
end = Sys.time()
end - start
## Time difference of 57.32071 secs
Great to see benchmarks, how does the second example compare with geodists
?
I understand, thank you for the clarification! Eventually whether it's worthwhile depends on cost-benefit of writing the function as opposed to waiting for the calculation to finish (and how often you need to do that calculation).
By the way, geodist
(and perhaps other libraries) have less accurate and faster distance metrics for geographic coordinates. Perhaps these can be used for the initial filtering, instead of RANN
, in order to solve the 180-degree line issue.
That sounds like a good plan @michaeldorman, accuracy isn't important in the first filter.
Reviving with reference to https://github.com/Robinlovelace/simodels/pull/33
Thanks @mem48 for nudge on this.
Happy to contribute this if you think it would be useful. The ideas is that you have points and want to make an OD datasets but cap the maximum distance considered. e.g. how we capped the PCT at 30 km.
Would need to add geodists as a dependancy.