Reduce memory consumption when loading transect & reduce runtime for fit.

PhilJd commented 1 year ago

This PR contains two changes to improve loading and fit for transect files. For a test file with 70 000 points on a 5 year old laptop

loading 70k points is now possible (~10 seconds, before the script ran out of memory with 40GB RAM)
fit time reduced from ~6 hours to ~ 20 minutes.

A smaller test file now loads in <1 second, compared to ~4 minutes before.

Note: The first commit adds a new dependency dbscan for the KD tree.

Description copied from the two commits:

Before this commit, spacing.traps materialized the full distance matrix between all points to then compute the nearest point. After this commit, the nearest neighbor is found directly using a KD tree, reducing memory from $O(n^2)$ to $O(n)$ and compute from $O(n^2)$ to $O(n \log(n))$ *n is number of points in the transect.

getxycpp is part of the inner loop of the fit function. For a relatively large number of points (~ 70 000 points) this reduces the runtime of the fit function from ~6 hours to ~20 minutes. (Runtime reduction for finding the upper bound of the point a certain distance away from $O(n)$ to $O(\log(n))$ )

MurrayEfford commented 1 year ago

Thanks for some very promising suggestions.

Did your concern about timing arise from a real analysis? The practical effect is (mostly) limited to 'transect' data for which I have no real-life examples (though I'm aware of one user with extensive bobcat data who might benefit).

The first commit concerns spacing.traps(), but that is not called from make.capthist as indicated, rather from functions that create traps objects (read.traps, make.grid etc.).

I'm hesitant about adding yet another dependency (dbscan) but can probably swallow that. The existing function nearestcpp could also be employed (with new argument to optionally block distance-to-self) - that would solve the memory issue without requiring dbscan, but without the full speed gain. Worth testing?

spacing.mask() is even more in need of speed up (masks may easily have 10000 points), so should apply same code there. I can do that later.

The second commit looks easy to accept - replaces my C++ code with a more efficient version.

I'm treading carefully because, as you can see, I have been the sole developer so far and I lack confidence with GitHub. I would add you as a contributor (ctb).

MurrayEfford commented 1 year ago

Note also that commit 1 fails in kNN when there are only 1 or two detectors, so need to deal separately with those cases. Certainly faster than nearestcpp.

PhilJd commented 1 year ago

Thanks a lot for the review and apologies for the late reply!

The concern for timing came from a friend's transect example with ~70k points. Before using knn, I've tried a mapply call with dist and min reduction per line, which resolved the memory issue but took about 4 minutes. However, using nearestcpp instead is indeed only marginally slower than knn at 16 seconds compared to 9 seconds for the KD tree. (I'm not really an R user, so wasn't aware how much slower it is compared to cpp ;))

Changes in the latest commit:

added non_zero argument to the nearestcpp call, replaced knn with nearestcpp, removed dbscan dependency
Switching to pass by const reference instead of copying NumericMatrix arguments in nearestcpp
adapted spacing.mask call

These changes should also address your comments for failures with only one or two detectors. If there is no point with non-zero distance, the result should be Numeric(0), is that the correct behavior?

MurrayEfford commented 1 year ago

Despite the appeal of kd-trees and their speed with large data (below), I think we should go with nearestcpp for long-term robustness, so I'll accept your PR. I may make some post-hoc tweaks for consistency etc. and add you (Philipp Jund) as ctb.

library(secr)
library(microbenchmark)

knnfn <- function(object) {
    # Make points unique to ensure the second neighbor is not identity.
    points = matrix(unlist(object), ncol=2)
    unique_points = mgcv::uniquecombs(points)
    # Retrieve two neighbors as the first result will be the query itself.
    knn_results = dbscan::kNN(unique_points, 2, points)
    # Choose the distance to the second neighbor for each point.
    sp = knn_results$dist[,2]
    median(sp)
}
# exporting cpp in working version of secr
spacing2 <- function (object)    {
    if (nrow(object)>1) {
        sp <- secr::nearestcpp(as.matrix(object), as.matrix(object), TRUE)$distance
        median(sp)
    }
    else
        numeric(0) # NA
}

trp <- make.grid(10,10)
nrow(trp)
[1] 100
msk <- make.mask(tr, spacing = 2)
nrow(msk)
[1] 36100

microbenchmark(
    spacing(trp, recalculate = TRUE),
    spacing2(trp), 
    knnfn(trp), 
    times = 100,
    unit = "milliseconds"
)
Unit: milliseconds
                            expr    min      lq     mean  median      uq    max neval
spacing(trp, recalculate = TRUE) 0.4643 0.58315 0.634861 0.62470 0.67200 1.0751   100
                   spacing2(trp) 0.0728 0.07790 0.099448 0.09345 0.10365 0.2331   100
                      knnfn(trp) 0.2563 0.28500 0.325351 0.30770 0.33455 0.8203   100

microbenchmark(
    spacing2(msk), 
    knnfn(msk), 
    times = 10,
    unit = "milliseconds"
)
Unit: milliseconds
         expr       min        lq       mean    median        uq       max neval
spacing2(msk) 1352.0774 1354.3101 1359.25407 1356.7759 1360.6635 1380.0798    10
   knnfn(msk)   62.9563   63.7364   65.33152   64.2828   65.5636   74.1578    10

MurrayEfford / secr

Reduce memory consumption when loading transect & reduce runtime for fit. #2