DistanceDevelopment / Distance

Simple distance sampling analysis
GNU General Public License v3.0
9 stars 8 forks source link

bootdht fails when cores > 1 and model object depends on objects in global environment #168

Closed lenthomas closed 1 week ago

lenthomas commented 1 year ago

Distance 1.0.7, R4.2.3

bootdht does not seem to pass variables from the global environment into the environments used when cores > 1. This is likely expected behaviour, but it causes problems for the code below, so perhaps something code be done to either facilitate passing the required objects, or raising an error message.

The scenario is that one fitted ds object is used to provide starting values for a second ds object, and this second ds object is passed in to bootdht as the model:

#Import data
DuikerCameraTraps <- read.csv(file="https://datadryad.org/stash/downloads/file_stream/73221", 
                              header=TRUE, sep="\t")
DuikerCameraTraps$object <- NA
DuikerCameraTraps$object[!is.na(DuikerCameraTraps$distance)] <- 1:sum(!is.na(DuikerCameraTraps$distance))
DuikerCameraTraps$Area <- DuikerCameraTraps$Area / (1000*1000)

#Fit detection function
library(Distance)
trunc.list <- list(left=2, right=15)
mybreaks <- c(2, 3, 4, 5, 6, 7, 8, 10, 12, 15)
#Note uni3 uses uni2 as starting values
uni2 <- ds(DuikerCameraTraps, transect = "point", key="unif", adjustment = "cos",
           nadj=2, cutpoints = mybreaks, truncation = trunc.list)
uni3 <- ds(DuikerCameraTraps, transect = "point", key="unif", adjustment = "cos",
           nadj=3, cutpoints = mybreaks, truncation = trunc.list, 
           initial_values = list(adjustment = c(as.numeric(uni2$ddf$par), 0)))

#Get bootstrap estimate of variance
#Summary statistic
mysummary <- function(ests, fit){
  return(data.frame(Label = ests$individuals$D$Label,
                    Dhat = ests$individuals$D$Estimate))
}
n.boot <- 3
#Single thread - 3 successes
#Note you get warnings, which it would be better if were turned off or
# stored, but not a big deal
set.seed(1235561)
daytime.boot.uni3 <- bootdht(model = uni3, flatfile = DuikerCameraTraps,
                         resample_transects = TRUE, nboot = n.boot, cores = 1,
                         summary_fun = mysummary)
daytime.boot.uni3

#Multi thread - all reps fail
set.seed(1235561)
daytime.boot.uni3.par <- bootdht(model = uni3, flatfile = DuikerCameraTraps,
                             resample_transects = TRUE, nboot = n.boot, cores = 3,
                             summary_fun = mysummary)
daytime.boot.uni3.par

When a version is run where start values are not set, it runs fine:

#Do a multi-thread run without setting start values - 3 successes
set.seed(1235561)
uni3.ownstart <- ds(DuikerCameraTraps, transect = "point", key="unif", adjustment = "cos",
           nadj=3, cutpoints = mybreaks, truncation = trunc.list)
daytime.boot.uni3.ownstart <- bootdht(model = uni3.ownstart, flatfile = DuikerCameraTraps,
                                 resample_transects = TRUE, nboot = n.boot, cores = 3,
                                 summary_fun = mysummary)
daytime.boot.uni3.ownstart

So the problem appears to be that uni3 depends on uni2 for its start values, and this causes an error inside the bootstrap when cores >1.

In the code below, I do a further check that it's not a problem with setting start values per se, by providing some manual start values.

#The above run is quite slow, so start hard-coding the uni2 starting values
print (c(as.numeric(uni2$ddf$par), 0))
set.seed(1235561)
uni3.hardstart <- ds(DuikerCameraTraps, transect = "point", key="unif", adjustment = "cos",
                    nadj=3, cutpoints = mybreaks, truncation = trunc.list,
                    initial_values = list(adjustment = c(0.97178179, 0.03541634, 0)))
daytime.boot.uni3.hardstart <- bootdht(model = uni3.hardstart, flatfile = DuikerCameraTraps,
                                      resample_transects = TRUE, nboot = n.boot, cores = 3,
                                      summary_fun = mysummary)
daytime.boot.uni3.hardstart

#Even faster if you use the uni3 values as starting values
print (as.numeric(uni3$ddf$par))
set.seed(1235561)
uni3.hardstart2 <- ds(DuikerCameraTraps, transect = "point", key="unif", adjustment = "cos",
                     nadj=3, cutpoints = mybreaks, truncation = trunc.list,
                     initial_values = list(adjustment = c(0.93541257, -0.05304261, -0.08043369)))
daytime.boot.uni3.hardstart2 <- bootdht(model = uni3.hardstart2, flatfile = DuikerCameraTraps,
                                       resample_transects = TRUE, nboot = n.boot, cores = 3,
                                       summary_fun = mysummary)
daytime.boot.uni3.hardstart2
lenthomas commented 1 year ago

@LHMarshall to look into how parallelization is done in dht vs dims and potentially change it. Also look briefly into other methods for getting objects passed through. If not fixable relatively quickly then simply document this issue in the bootdht help and leave this as a future enhancement.

LHMarshall commented 11 months ago

Check to see if eval/parse functions can force these starting values to be created within bootdht and see if this is a way to ensure that they are passed in to the parallel processes.

lenthomas commented 11 months ago

Hi @LHMarshall, regarding an update to the documentation for bootdht, we currently say (under "Parallelization"):

It is also hard to debug any issues in summary_fun so it is best to run a small number of bootstraps first in parallel to check that things work. On Windows systems summary_fun does not have access to the global environment when running in parallel, so all computations must be made using only its ests and fit arguments (i.e., you can not use R objects from elsewhere in that function, even if they are available to you from the console).

This implies summary_fun does have access to the global environment when running in non-Windows OSs. Is that actually correct?

Anyway, here's an additional suggested paragraph, to go after the one above (i.e., still under parallelization):

Another consequence of the global environment being unavailable inside parallel bootstraps is that any starting values in the model object passed in to bootdht must be hard coded (otherwise you get back 0 successful bootstraps). For a worked example showing this, see the camera trap distance sampling online example.

lenthomas commented 1 week ago

I am tempted to close this issue since there seems no real need to resolve it, and it seems like a big job. Comments @LHMarshall / @erex?

erex commented 1 week ago

Came about as a result of fiddling with starting values as I recall. If @lenthomas is happy with bootstrap performance for CTDS without starting values, then it can be closed.

lenthomas commented 1 week ago

Thanks @erex.

It is possible to have starting values in the parallelized bootstrap - they just need to be hard coded in. The reason for dependence on objects in the global environment is a deep-level R issue and will be hard to solve, for relatively small gain. Bootstrap analysis of the same data in our R CTDS example in DistWin is order(s) of magnitude faster, wheter the R version is run with MCDS.exe or the new build-in optimizer. This makes me think there are other reasons why the bootstrap in R is relatively slow -- likely to do with the amount of overhead involved in passing data around. This is something that could be looked into at some point in the future, so I'll raise a low priority enhancement issue about it and reference here.