Closed PhilJd closed 1 year ago
Thanks for some very promising suggestions.
Did your concern about timing arise from a real analysis? The practical effect is (mostly) limited to 'transect' data for which I have no real-life examples (though I'm aware of one user with extensive bobcat data who might benefit).
The first commit concerns spacing.traps(), but that is not called from make.capthist as indicated, rather from functions that create traps objects (read.traps, make.grid etc.).
I'm hesitant about adding yet another dependency (dbscan) but can probably swallow that. The existing function nearestcpp could also be employed (with new argument to optionally block distance-to-self) - that would solve the memory issue without requiring dbscan, but without the full speed gain. Worth testing?
spacing.mask() is even more in need of speed up (masks may easily have 10000 points), so should apply same code there. I can do that later.
The second commit looks easy to accept - replaces my C++ code with a more efficient version.
I'm treading carefully because, as you can see, I have been the sole developer so far and I lack confidence with GitHub. I would add you as a contributor (ctb).
Note also that commit 1 fails in kNN when there are only 1 or two detectors, so need to deal separately with those cases. Certainly faster than nearestcpp.
Thanks a lot for the review and apologies for the late reply!
The concern for timing came from a friend's transect example with ~70k points.
Before using knn
, I've tried a mapply
call with dist
and min reduction per line, which
resolved the memory issue but took about 4 minutes. However, using nearestcpp instead is
indeed only marginally slower than knn at 16 seconds compared to 9 seconds for the KD tree.
(I'm not really an R user, so wasn't aware how much slower it is compared to cpp ;))
Changes in the latest commit:
non_zero
argument to the nearestcpp call, replaced knn with nearestcpp, removed dbscan dependencyNumericMatrix
arguments in nearestcppThese changes should also address your comments for failures with only one or two detectors. If there is no point with non-zero distance, the result should be Numeric(0), is that the correct behavior?
Despite the appeal of kd-trees and their speed with large data (below), I think we should go with nearestcpp for long-term robustness, so I'll accept your PR. I may make some post-hoc tweaks for consistency etc. and add you (Philipp Jund) as ctb.
library(secr)
library(microbenchmark)
knnfn <- function(object) {
# Make points unique to ensure the second neighbor is not identity.
points = matrix(unlist(object), ncol=2)
unique_points = mgcv::uniquecombs(points)
# Retrieve two neighbors as the first result will be the query itself.
knn_results = dbscan::kNN(unique_points, 2, points)
# Choose the distance to the second neighbor for each point.
sp = knn_results$dist[,2]
median(sp)
}
# exporting cpp in working version of secr
spacing2 <- function (object) {
if (nrow(object)>1) {
sp <- secr::nearestcpp(as.matrix(object), as.matrix(object), TRUE)$distance
median(sp)
}
else
numeric(0) # NA
}
trp <- make.grid(10,10)
nrow(trp)
[1] 100
msk <- make.mask(tr, spacing = 2)
nrow(msk)
[1] 36100
microbenchmark(
spacing(trp, recalculate = TRUE),
spacing2(trp),
knnfn(trp),
times = 100,
unit = "milliseconds"
)
Unit: milliseconds
expr min lq mean median uq max neval
spacing(trp, recalculate = TRUE) 0.4643 0.58315 0.634861 0.62470 0.67200 1.0751 100
spacing2(trp) 0.0728 0.07790 0.099448 0.09345 0.10365 0.2331 100
knnfn(trp) 0.2563 0.28500 0.325351 0.30770 0.33455 0.8203 100
microbenchmark(
spacing2(msk),
knnfn(msk),
times = 10,
unit = "milliseconds"
)
Unit: milliseconds
expr min lq mean median uq max neval
spacing2(msk) 1352.0774 1354.3101 1359.25407 1356.7759 1360.6635 1380.0798 10
knnfn(msk) 62.9563 63.7364 65.33152 64.2828 65.5636 74.1578 10
This PR contains two changes to improve loading and fit for transect files. For a test file with 70 000 points on a 5 year old laptop
A smaller test file now loads in <1 second, compared to ~4 minutes before.
Note: The first commit adds a new dependency dbscan for the KD tree.
Description copied from the two commits:
Before this commit, spacing.traps materialized the full distance matrix between all points to then compute the nearest point. After this commit, the nearest neighbor is found directly using a KD tree, reducing memory from $O(n^2)$ to $O(n)$ and compute from $O(n^2)$ to $O(n \log(n))$ *n is number of points in the transect.
getxycpp is part of the inner loop of the fit function. For a relatively large number of points (~ 70 000 points) this reduces the runtime of the fit function from ~6 hours to ~20 minutes. (Runtime reduction for finding the upper bound of the point a certain distance away from $O(n)$ to $O(\log(n))$ )