Closed erex closed 3 years ago
To see if bootstrap problems existed with other data sets besides minkes, I tried to bootstrap the amakihi. I wanted to use the vanilla bootstrap summary function; this requires adding a dummy value for "Area" so that an abundance estimate is produced as well as a density estimate.
library(Distance)
data("amakihi")
amakihi$Area <- 1
trunc <- 82.5
conv <- convert_units("meter", NULL, "hectare")
plain <- ds(data=amakihi, transect = "point", convert.units = conv, key="hr", truncation = trunc)
bootplain <- bootdht(model=plain, flatfile=amakihi, convert.units = conv, nboot=10)
I must be doing something fundamentally wrong with the above code because bootdht
runs instantaneously, but produces only this output.
> summary(bootplain)
Bootstrap results
Boostraps : 10
Successes : 0
Failures : 10
data frame with 0 columns and 0 rows
> str(bootplain)
List of 1
$ c.NA..NA..NA..NA..NA..NA..NA..NA..NA..NA.: logi [1:10] NA NA NA NA NA NA ...
- attr(*, "class")= chr "dht_bootstrap"
- attr(*, "row.names")= int [1:10] 1 2 3 4 5 6 7 8 9 10
- attr(*, "nboot")= num 10
- attr(*, "nbootfail")= num 10
The amakihi case is a different issue to the former. With the latter, the issue is that you have named the truncation distance trunc
and in looking through and assigning model components, bootdht
assigns the truncation
argument to be the function trunc()
rather than your object. Renaming to trunc1
will fix this for your code but I will find a work-around.
This second issue should be fixed in 952b2de
thanks Dave, spectacularly poor choice of variable name on my part. I couldn't spot why bootdht was executed without hesitation or complaint.
altering the variable name from trunc
to trunc1
does indeed result in the bootstrap being performed. It does however, appear the HR model is difficult to fit for the amakihi. Here is the quick result of 200 replicate bootstraps fitted to the HR model without covariates for the full amakihi data set (ignoring the 7 survey period "strata", providing estimates for total across surveys):
> summary(moreplain)
Bootstrap results
Boostraps : 200
Successes : 200
Failures : 0
median mean se lcl ucl cv
Nhat 7.36 12.47 14.55 4.14 55.18 1.98
Distribution of the 200 estimates:
Contrast with the results table from dht
:
> plain$dht$individuals$N
Label Estimate se cv lcl ucl df
1 1292 4.981627 0.5390201 0.10820163 4.019703 6.173741 83.90830
2 194 5.971009 0.5719030 0.09577996 4.941104 7.215584 112.84238
3 493 8.019204 0.6787220 0.08463708 6.787357 9.474622 169.90478
4 494 8.362016 0.6415304 0.07671958 7.188253 9.727443 165.73773
5 495 7.543606 0.8411892 0.11151022 6.046346 9.411635 79.55191
6 792 5.068415 0.4160937 0.08209544 4.311928 5.957619 193.89885
7 793 7.152193 0.6103186 0.08533307 6.044831 8.462414 160.42558
8 Total 47.098070 3.1363378 0.06659164 41.336648 53.662509 1439.72791
This second issue should be fixed in 952b2de
In general, is it possible to check for "reserved words" - e.g., key object names that already exist - in functions so they return a helpful error message in such cases? Perhaps that's the fix you implemented @dill?
Short bootstrap experiment with amakihi using "preferred" model of Marques et al. (2007): HR(OBs+MAS) (still ignoring the multi-strata nature of the survey):
auk <- ds(data=amakihi, transect = "point", convert.units = conv,
key="hr", formula=~OBs+MAS, truncation = trunc1)
bootauk <- bootdht(model=auk, flatfile=amakihi, convert.units = conv,
nboot=150, cores = 3)
> summary(bootauk)
Bootstrap results
Boostraps : 150
Successes : 142
Failures : 8
median mean se lcl ucl cv
Nhat 7.11 11.91 13.75 4.03 52.6 1.94
against the estimates from dht
> auk$dht$individuals$N
Label Estimate se cv lcl ucl df
1 1292 5.769804 0.7369180 0.12771977 4.486156 7.420748 130.8300
2 194 4.784727 0.4626806 0.09669947 3.952403 5.792328 114.5678
3 493 6.732407 0.5598072 0.08315112 5.714966 7.930984 172.1480
4 494 7.322283 0.6146885 0.08394766 6.206335 8.638886 180.9466
5 495 7.659768 0.7527264 0.09827014 6.307891 9.301371 115.3648
6 792 7.275320 1.0825288 0.14879467 5.439988 9.729851 522.2015
7 793 8.520232 1.0150499 0.11913408 6.739314 10.771772 156.8546
8 Total 48.064539 3.2555628 0.06773315 42.090882 54.885994 1448.8640
Bootstrap of amakihi, pulling out survey-specific replicate density estimates:
auk <- ds(data=amakihi, transect = "point", convert.units = conv,
key="hr", formula=~OBs+MAS, truncation = trunc1)
bob3 <- function(ests, fit) {
return(list(
data.frame(dec92=ests$individuals$D$Estimate[1]),
data.frame(jan94=ests$individuals$D$Estimate[2]),
data.frame(apr93=ests$individuals$D$Estimate[3]),
data.frame(apr94=ests$individuals$D$Estimate[4]),
data.frame(apr95=ests$individuals$D$Estimate[5]),
data.frame(jul92=ests$individuals$D$Estimate[6]),
data.frame(jul93=ests$individuals$D$Estimate[7]),
data.frame(total=ests$individuals$D$Estimate[8])))
}
bootauk.bysurvey <- bootdht(model=auk, flatfile=amakihi, convert.units = conv,
summary_fun = bob3, nboot=200, cores = 3)
dec92 <- unlist(bootauk.bysurvey[1:198])
jan94 <- unlist(bootauk.bysurvey[199:396])
apr93 <- unlist(bootauk.bysurvey[397:594])
apr94 <- unlist(bootauk.bysurvey[595:790])
apr95 <- unlist(bootauk.bysurvey[793:992])
jul92 <- unlist(bootauk.bysurvey[993:1188])
jul93 <- unlist(bootauk.bysurvey[1189:1386])
total <- unlist(bootauk.bysurvey[1387:1584])
par(mfrow=c(4,2))
hist(dec92, breaks=30)
hist(jan94, breaks=30)
hist(apr93, breaks=30)
hist(apr94, breaks=30)
hist(apr95, breaks=30)
hist(jul92, breaks=30)
hist(jul93, breaks=30)
hist(total, breaks=30)
Behaviour here seems much less strange than when examining distribution of abundance estimates not respecting strata-specific estimates.
Thanks for these additional details @erex. I'll try to set aside a few days in the near future to get to grips with this issue (which I think is two separate things: detection function fitting and stratification).
@lenthomas it's not really an issue of reserved words as such, it's the environment system/search tree in R and where it will look for given names to find the correct variable, given a name. The change ensures that bootdht
looks for variables in the global environment, not in the function/apply environments first. S
@lenthomas it's not really an issue of reserved words as such, it's the environment system/search tree in R and where it will look for given names to find the correct variable, given a name. The change ensures that
bootdht
looks for variables in the global environment, not in the function/apply environments first. S
Thanks @dill. You know far more about this than me -- but for my interest, does this mean that if the user has a variable in their global environment with the same name as one you have created in the function environment that there is the potential for them to break your code? I guess I should be less lazy and actually look at your code and relate to my knowledge of R scoping...
So that shouldn't happen in general but the issue is this block here where I am tracking down the arguments supplied to the original detection function by name. Ensuring that we're looking for the names in the right places is really important here and that's where the issue you highlight can be an issue. The addition of parent.frame(n=3)
bumps the search out of anonymous function called by lapply
(1), then again out of bootdht
(2) into the global environment (3).
I found this chapter of the Advanced R book really helpful in understanding this, but as you can tell it's still easy to make a mistake.
@erex changes in 0b2e074 and other CTDS branch commits may be relevant here for tracking what's going on. Including adding a bootstrap_ID
column to help keep track, revised default summary function that keeps stratum labels and a retunr that falls-back to data.frame
.
thanks Dave. I'll revisit the minke/amakihi data sets in light of the commit you reference later today.
Breaking up this report into 3 sections (1/1)
Amakihi bootstraps seem OK
library(Distance)
data("amakihi")
conv <- convert_units(amakihi_units$Units[2], NULL, amakihi_units$Units[1])
trunc1 <- 82.5
auk <- ds(data=amakihi, transect = "point", convert.units = conv,
key="hr", formula=~OBs+MAS, truncation = trunc1)
ex.D <- function(ests, fit) {
return(data.frame(Label = ests$individuals$D$Label,
Dhat = ests$individuals$D$Estimate))
}
bootauk.bysurvey <- bootdht(model=auk, flatfile=amakihi, convert.units = conv,
summary_fun = ex.D, nboot=200, cores = 3)
# library(vioplot)
# vioplot(Dhat~Label, subset=Label!="Total", data=bootauk.bysurvey)
library(FSA)
hist(Dhat~Label, data=subset(bootauk.bysurvey, Label!="Total"),
pre.main="Amakihi survey:", breaks=20)
Less good news regarding minkes using pooled detection function (2/3)
data("minke")
easy <- ds(data=minke, key="hr", truncation=1.5)
ex.N <- function(ests, fit) {
return(data.frame(Label = ests$individuals$N$Label,
Dhat = ests$individuals$N$Estimate))
}
easyboot <- bootdht(model=easy, flatfile=minke, nboot=500, cores = 3,
summary_fun = ex.N)
hist(Dhat~Label, data=subset(easyboot, Label!="Total"),
pre.main="Minke strata:", breaks=20)
tmp <- sort(easyboot$Dhat[easyboot$Label=="North"], decreasing = TRUE)
> head(tmp, n=15)
[1] 6133833.43 618157.55 570208.80 239256.59 199412.24 164542.96 124374.64 29935.73
[9] 29113.36 28602.89 28536.43 28325.86 27177.21 26271.82 25160.13
Seven of 498 replicates give exotic stratum-specific estimates of abundance (point estimate of North stratum abundance is 12181).
Least good news minkes using stratum-covariate detection function in bootstrap (3/3)
minke.strat <- ds(data=minke, key="hr", truncation = 1.5, formula=~Region.Label)
stratboot <- bootdht(model=minke.strat, flatfile=minke, nboot=50, cores = 1,
summary_fun = ex.N)
hist(Dhat~Label, data=subset(stratboot, Label!="Total"),
pre.main="Minke strata:", breaks=20)
Four separate attempts to generate 50 reps, each produce this message (at different bootstrap reps)
|======================================== | 44%
Error in fix.by(by.x, x) : 'by' must specify a uniquely valid column
One further request:
Now that the object created by bootdht
contains bootstrap_ID
as its last row, can summary.dht_bootstrap
not produce a summary for bootstrap_ID
, perhaps by removing the final row from object
here
Thanks for this info @erex. I think I have addressed 2 and 3 in the latest commits, but I'd be interested to know how you get along. I did some rudimentary testing both in parallel, which didn't immediately crash.
Trying out the stratified minke example (for 200 reps, not in parallel) following recent commits, I get:
> summary(subset(stratboot, Label=="North")$Dhat)
Min. 1st Qu. Median Mean 3rd Qu. Max.
5308 11561 15922 16883 20091 36681
> summary(subset(stratboot, Label=="South")$Dhat)
Min. 1st Qu. Median Mean 3rd Qu. Max.
3001 6534 8634 17923 13669 215605
Giving the following histograms:
(Note that these are abundance estimates, despite the Dhat
label, see code above.)
Results with minkes, pooled detection function with multiple cores:
Distance * 1.0.3.9002 2021-07-22 [1] Github (DistanceDevelopment/Distance@beb557d)
> data(minke)
> easy <- ds(data=minke, key="hr", truncation=1.5)
Starting AIC adjustment term selection.
Fitting hazard-rate key function
Key only model: not constraining for monotonicity.
AIC= 48.637
Fitting hazard-rate key function with cosine(2) adjustments
AIC= 50.386
Hazard-rate key function selected.
> ex.N <- function(ests, fit) {
+ return(data.frame(Label = ests$individuals$N$Label,
+ Dhat = ests$individuals$N$Estimate))
+ }
> easyboot <- bootdht(model=easy, flatfile=minke, nboot=500, cores = 3,
+ summary_fun = ex.N)
Performing 500 bootstraps
Progress bars cannot be shown when using cores>1
Error in { : task 387 failed - "$ operator is invalid for atomic vectors"
about to check with non-parallel simulation
Similar results for pooled minke boostrap when running with single core (lots of warning messages for nearly each iteration), eventually failing
Warning in mrds::check.mono(model, n.pts = 20) :
Detection function is not strictly monotonic!
|=============================================================================== | 85%Error :
gosolnp-->Could not find a feasible starting point...exiting
In addition: Warning messages:
1: In p0 * vscale[(neq + 2):(nc + np + 1)] :
longer object length is not a multiple of shorter object length
2: In p0 * vscale[(neq + 2):(nc + np + 1)] :
longer object length is not a multiple of shorter object length
3: In p0 * vscale[(neq + 2):(nc + np + 1)] :
longer object length is not a multiple of shorter object length
Error : $ operator is invalid for atomic vectors
|================================================================================ | 86%Warning in check.mono(result, n.pts = control$mono.points) :
Detection function is not strictly monotonic!
Warning in check.mono(result, n.pts = control$mono.points) :
Detection function is not strictly monotonic!
Warning in check.mono(result, n.pts = control$mono.points) :
Detection function is not strictly monotonic!
Warning in mrds::check.mono(model, n.pts = 20) :
Detection function is not strictly monotonic!
|================================================================================= | 88%Error :
gosolnp-->Could not find a feasible starting point...exiting
In addition: Warning messages:
1: In p0 * vscale[(neq + 2):(nc + np + 1)] :
longer object length is not a multiple of shorter object length
2: In p0 * vscale[(neq + 2):(nc + np + 1)] :
longer object length is not a multiple of shorter object length
3: In p0 * vscale[(neq + 2):(nc + np + 1)] :
longer object length is not a multiple of shorter object length
Error : $ operator is invalid for atomic vectors
|=================================================================================== | 90%Error in integrate(dpdf, lower = x[i, 1], upper = x[i, 2], width = width[i], :
the integral is probably divergent
In addition: Warning messages:
1: In optimx.run(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower, :
Hessian not computable after method nlminb
2: In optimx.run(par, optcfg$ufn, optcfg$ugr, optcfg$uhess, lower, :
Hessian not computable after method nlminb
Error: $ operator is invalid for atomic vectors
Thought came to me that perhaps the minkes are simply a pathological data set (unable to be bootstrapped). Thought I'd see what DistWin thought about bootstrapping these data for the pooled hazard rate detection function. Here's the result:
Estimate %CV # df 95% Confidence Interval
--------------------------------------------------------
Stratum: 1. Ideal Habitat (South in Distance package)
DS 0.53078E-01 91.60 999 17.56 0.10267E-01 0.27440
0.20037E-01 0.15183
Stratum: 2. Marginal Habitat (North in Distance package)
DS 0.22515E-01 72.22 999 12.89 0.55475E-02 0.91377E-01
0.81129E-02 0.54586E-01
For comparison, point estimates of group density from DistWin are 0.043 (South, CV 25%), 0.019 (North, CV 39%). So the means of the bootstrap reps are inflated (probably because of long right tail caused by some replicates having hazard rate detection functions that fall away very rapidly).
Data set in DistWin contains groups sizes, that are missing in Distance package, so attention should be on DS (density of groups). Values on the second line are the percentile CI bounds.
DistWin log window claims there are 5 convergence failures in 999 replicates. Lessons from DistWin, not necessarily a pathological data set, but magnitude of uncertainty is quite large.
hopefully this is resolved in cc5fba2 and was down to too much error handling
On 22/07/2021 09:11, erex wrote:
Thought came to me that perhaps the minkes are simply a pathological data set (unable to be bootstrapped). Thought I'd see what DistWin thought about bootstrapping these data for the pooled hazard rate detection function. Here's the result:
|Estimate %CV # df 95% Confidence Interval -------------------------------------------------------- Stratum: 1. Ideal Habitat (South in Distance package) DS 0.53078E-01 91.60 999 17.56 0.10267E-01 0.27440 0.20037E-01 0.15183 Stratum: 2. Marginal Habitat (North in Distance package) DS 0.22515E-01 72.22 999 12.89 0.55475E-02 0.91377E-01 0.81129E-02 0.54586E-01 |
Data set in DistWin contains groups sizes, that are missing in Distance package, so attention should be on DS (density of groups). Values on the second line are the percentile CI bounds.
DistWin log window claims there are 5 convergence failures in 999 replicates. Lessons from DistWin, not necessarily a pathological data set, but magnitude of uncertainty is quite large.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DistanceDevelopment/Distance/issues/98#issuecomment-884731272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAPIKCT2XUS5OZM77DFNTTY7HB5ANCNFSM47WTISXA.
[ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/DistanceDevelopment/Distance/issues/98#issuecomment-884731272", "url": "https://github.com/DistanceDevelopment/Distance/issues/98#issuecomment-884731272", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
(I'm running ~500 resamples in serial in debug mode to check but that will take a while)
Results of 200 reps (multi-core) using Distance * 1.0.3.9002 2021-07-22 [1] Github (DistanceDevelopment/Distance@cc5fba2)
no falling over, nor generating lots of warnings
> summary(easyboot)
Bootstrap results
Boostraps : 200
Successes : 199
Failures : 1
> summary(stratboot)
Bootstrap results
Boostraps : 200
Successes : 195
Failures : 5
Thanks Eric. I think the issue here with uncertainty (in both Distance
and Distance for Windows) is that there is a tendency with the harzard rate to fit a spike to these data, which occasionally leads to a low p, and large Nhats?
I suspect you are right. Occupational hazard of working with bootstraps with fragile models or data; can give rise to bizarre results. Guess advice for folks is: estimates of precision will be quite huge; price you pay for having a spike near the transect.
I'm embarked on chasing difference in QAIC calculations between Howe et al. (2018) and practical 4 of CTDS workshop. Will make a report this afternoon
Glad we've come to the same conclusion here. In which case am I okay to close this issue?
On 22/07/2021 13:56, erex wrote:
I suspect you are right. Occupational hazard of working with bootstraps with fragile models or data; can give rise to bizarre results. Guess advice for folks is: estimates of precision will be quite huge; price you pay for having a spike near the transect.
I'm embarked on chasing difference in QAIC calculations between Howe et al. (2018) and practical 4 of CTDS workshop. Will make a report this afternoon
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DistanceDevelopment/Distance/issues/98#issuecomment-884889717, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAAPIO4YK6WNH5RVYLHJODTZAIOXANCNFSM47WTISXA.
[ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/DistanceDevelopment/Distance/issues/98#issuecomment-884889717", "url": "https://github.com/DistanceDevelopment/Distance/issues/98#issuecomment-884889717", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.": "Organization", "name": "GitHub", "url": "https://github.com" } } ]
yep
I think you're exactly right @erex regarding "price you pay for having a spike". Potential mitigation strategies given the data have already been collected are:
Minkes with hazard rate and pooled detection function
Using
bootdht
There must be an easier way to get stratum-specific replicate estimates, but I don't know how.
North
Four convergence failures, but also 5 replicate estimates > 10X the point estimate of 12181 out of 499 (roughly 2%). Mean is 3X the median.
South
Similarly, 4 failures along with 8 replicate estimates > 10X the point estimate of 3653. Mean is 4X median
Total
Four failures plus 7 replicate estimates > 10X point estimate of 15834. Mean is 3X median.
From Distance 7.3