Closed LHMarshall closed 1 year ago
@LHMarshall see related issue https://github.com/DistanceDevelopment/Distance/issues/151
about knock-on consequences of uniform detection function and Hessian for dht2
@lenthomas how should the expected cluster size be calculated in the case of a uniform detection function with no adjustments?
I need an alternative for this situation than the current code based on all sorts that are NULL in this situation. I guess there is no variability from the N_hat and Nc_hat values so is it just the variability observed in the E[cluster size] values? The standard deviation of the cluster sizes observed in each stratum divided by n-1?
# This computes the se(E(s)). It essentially uses 3.37 from Ads but in
# place of using 3.25, 3.34 and 3.38, it uses 3.27, 3.35 and an equivalent
# cov replacement term for 3.38. This uses line to line variability
# whereas the other formula measure the variance of E(s) within the lines
# and it goes to zero as p approaches 1.
if(se & options$varflag!=1){
numRegions <- length(unique(samples$Region.Label))
if(options$varflag==2){
cov.Nc.Ncs <- rep(0, numRegions)
scale <- clusters$summary$Area/clusters$summary$CoveredArea
for(i in 1:numRegions){
c.stratum.data <- clusters$Nhat.by.sample[
as.character(clusters$Nhat.by.sample$Region.Label) ==
as.character(region.table$Region.Label[i]), ]
i.stratum.data <- individuals$Nhat.by.sample[
as.character(individuals$Nhat.by.sample$Region.Label) ==
as.character(region.table$Region.Label[i]), ]
Li <- sum(c.stratum.data$Effort.x)
cov.Nc.Ncs[i] <- covn(c.stratum.data$Effort.x/(scale[i]*Li),
c.stratum.data$Nhat/scale[i],
i.stratum.data$Nhat/scale[i],
options$ervar)
}
}else{
cov.Nc.Ncs <- as.vector(by(obs$size*(1 - obs$pdot)/obs$pdot^2,
obs$Region.Label, sum))
cov.Nc.Ncs[is.na(cov.Nc.Ncs)] <- 0
}
cov.Nc.Ncs[is.nan(cov.Nc.Ncs)] <- 0
if(numRegions > 1){
cov.Nc.Ncs <- c(cov.Nc.Ncs, sum(cov.Nc.Ncs))
}
cov.Nc.Ncs <- cov.Nc.Ncs +
diag(t(clusters$vc$detection$partial)%*%
solvecov(model$hessian)$inv%*%
individuals$vc$detection$partial)
se.Expected.S <- as.vector(clusters$N$cv)^2 +
as.vector(individuals$N$cv)^2 -
2*cov.Nc.Ncs/
(as.vector(individuals$N$Estimate)*
as.vector(clusters$N$Estimate))
Expected.S[is.nan(Expected.S)] <- 0
se.Expected.S[se.Expected.S<=0 | is.nan(se.Expected.S)] <- 0
se.Expected.S <- as.vector(Expected.S)*sqrt(se.Expected.S)
Expected.S <- data.frame(Region = clusters$N$Label,
Expected.S = as.vector(Expected.S),
se.Expected.S = as.vector(se.Expected.S))
}else{
Expected.S[is.nan(Expected.S)] <- 0
Expected.S <- data.frame(Region = clusters$N$Label,
Expected.S = as.vector(Expected.S))
}
It no longer crashes... but it omits the se of the expected cluster size. I notice that there is still variability in the abundance estimates due to the encounter rate variability which will need to be incorporated into the se of the expected cluster size.
> dfm <- ds( dat ,
+ transect = "line",
+ truncation = 400,
+ key = "unif",
+ optimizer = "R",
+ dht_se = TRUE)
Starting AIC adjustment term selection.
Fitting uniform key function
AIC= 18142.155
Fitting uniform key function with cosine(1) adjustments
AIC= 18143.256
Uniform key function selected.
> summary(dfm)
Summary for distance analysis
Number of observations : 1514
Distance range : 0 - 400
Model : Uniform key function
AIC : 18142.15
Optimisation: mrds (nlminb)
Detection function parameters
Scale coefficient(s):
NULL
Estimate
Average p 1
N in covered region 1514
Summary for clusters
Summary statistics:
Region Area CoveredArea Effort n k ER se.ER cv.ER
1 20A 6536.115 12721112000 15901390 390 305 2.452616e-05 1.075680e-06 0.04385848
2 35A 3486.357 7406071200 9257589 321 134 3.467425e-05 5.024032e-06 0.14489228
3 35B 3433.566 10967019200 13708774 307 194 2.239442e-05 2.196862e-06 0.09809864
4 53A 3508.205 6576117600 8220147 209 171 2.542534e-05 1.146461e-06 0.04509128
5 53B 4125.793 7253688800 9067111 287 164 3.165286e-05 2.784125e-06 0.08795807
6 Total 21090.037 44924008800 56155011 1514 968 2.740042e-05 1.123309e-06 0.04099607
Abundance:
Label Estimate se cv lcl ucl df
1 20A 2.003822e-04 8.788461e-06 0.04385848 1.838212e-04 0.0002184353 304.0000
2 35A 1.511085e-04 2.189446e-05 0.14489228 1.136239e-04 0.0002009593 133.0000
3 35B 9.611589e-05 9.428838e-06 0.09809864 7.924424e-05 0.0001165796 193.0000
4 53A 1.114966e-04 5.027526e-06 0.04509128 1.020057e-04 0.0001218706 170.0000
5 53B 1.632414e-04 1.435840e-05 0.08795807 1.372605e-04 0.0001941401 163.0000
6 Total 7.223447e-04 2.961329e-05 0.04099607 6.664233e-04 0.0007829587 374.6173
Density:
Label Estimate se cv lcl ucl df
1 20A 3.065770e-08 1.344600e-09 0.04385848 2.812392e-08 3.341975e-08 304.0000
2 35A 4.334282e-08 6.280040e-09 0.14489228 3.259103e-08 5.764163e-08 133.0000
3 35B 2.799302e-08 2.746077e-09 0.09809864 2.307928e-08 3.395293e-08 193.0000
4 53A 3.178167e-08 1.433076e-09 0.04509128 2.907632e-08 3.473874e-08 170.0000
5 53B 3.956608e-08 3.480156e-09 0.08795807 3.326888e-08 4.705522e-08 163.0000
6 Total 3.425052e-08 1.404137e-09 0.04099607 3.159896e-08 3.712458e-08 374.6173
Summary for individuals
Summary statistics:
Region Area CoveredArea Effort n k ER se.ER cv.ER mean.size se.mean
1 20A 6536.115 12721112000 15901390 264 305 1.660232e-05 3.116001e-06 0.1876847 0.6769231 0.07176285
2 35A 3486.357 7406071200 9257589 647 134 6.988861e-05 1.730274e-05 0.2475760 2.0155763 0.14483992
3 35B 3433.566 10967019200 13708774 483 194 3.523291e-05 9.024984e-06 0.2561521 1.5732899 0.15410961
4 53A 3508.205 6576117600 8220147 238 171 2.895325e-05 5.958753e-06 0.2058060 1.1387560 0.14945122
5 53B 4125.793 7253688800 9067111 511 164 5.635753e-05 1.051900e-05 0.1866476 1.7804878 0.15575080
6 Total 21090.037 44924008800 56155011 2143 968 3.827586e-05 4.060756e-06 0.1060918 1.4154557 0.06098587
Abundance:
Label Estimate se cv lcl ucl df
1 20A 0.0001356434 2.545818e-05 0.1876847 9.405647e-05 0.0001956178 304.0000
2 35A 0.0003045708 7.540442e-05 0.2475760 1.880057e-04 0.0004934072 133.0000
3 35B 0.0001512182 3.873485e-05 0.2561521 9.197454e-05 0.0002486224 193.0000
4 53A 0.0001269674 2.613066e-05 0.2058060 8.493381e-05 0.0001898035 170.0000
5 53B 0.0002906494 5.424901e-05 0.1866476 2.016855e-04 0.0004188555 163.0000
6 Total 0.0010090492 1.070519e-04 0.1060918 8.195958e-04 0.0012422956 420.9479
Density:
Label Estimate se cv lcl ucl df
1 20A 2.075290e-08 3.895002e-09 0.1876847 1.439027e-08 2.992876e-08 304.0000
2 35A 8.736076e-08 2.162843e-08 0.2475760 5.392612e-08 1.415251e-07 133.0000
3 35B 4.404114e-08 1.128123e-08 0.2561521 2.678688e-08 7.240938e-08 193.0000
4 53A 3.619157e-08 7.448441e-09 0.2058060 2.421005e-08 5.410273e-08 170.0000
5 53B 7.044692e-08 1.314875e-08 0.1866476 4.888405e-08 1.015212e-07 163.0000
6 Total 4.784483e-08 5.075945e-09 0.1060918 3.886175e-08 5.890438e-08 420.9479
Expected cluster size
Region Expected.S
1 20A 0.6769231
2 35A 2.0155763
3 35B 1.5732899
4 53A 1.1387560
5 53B 1.7804878
6 Total 1.3969080
Hi @LHMarshall I'm afraid I'm not familiar with the code here at all, but I do think I know what should happen from the statistics perspective. When using a uniform+no adj then the expected cluster size is the mean cluster size, so the standard error of mean cluster size is the standard deviation of cluster size divided by sqrt (n). (Note, not n-1.) This formula can be used at both the stratum level and, presumably, at the global level. I say "presumably" because I assume that the "Total" expected cluster size is simply the mean cluster size taken over all observed individuals. It may not be that simple however -- for example it might be an area-weighted mean if the samples in each stratum are taken to be representative of the groups in that stratum and the strata differ in area. So it's important to know how the "Total" E(S) is arrived at in order to know exactly how to calculate the "Total" se(E(S)). Hope that makes sense.
@lenthomas Ok I've looked into how the E(S) values are calculated. The total E(S) value looks to be a weighted mean based on the estimated abundance / density which makes sense. What does that mean for calculating the se of the total E(S) as there will be variability associated with the values which the weighted mean is based on due to encounter rate?
See exploratory code below
> fit <- ds(dist.dat,
+ key = "unif",
+ nadj = 0,
+ truncation = 0.5)
Fitting uniform key function
AIC= -346.574
>
> fit.sum <- summary(fit)
>
> fit.sum$dht$individuals$N$Estimate
[1] 1024.239 285.000 2100.000 3409.239
> fit.sum$dht$clusters$N$Estimate
[1] 84.13043 84.64286 85.00000 253.77329
> # The Expected.S values are calculated as
> fit.sum$dht$individuals$N$Estimate/fit.sum$dht$clusters$N$Estimate
[1] 12.174419 3.367089 24.705882 13.434192
>
> fit.sum$dht$Expected.S$Expected.S
[1] 12.174419 3.367089 24.705882 13.434192
>
> tmp <- na.omit(dist.dat2)
>
> # This is simply the mean of the observed clusters in each stratum for the stratum estimates
> mean.size.by.strat <- by(tmp$size, tmp$Region.Label, mean)
> mean.size.by.strat
tmp$Region.Label: central
[1] 12.17442
-------------------------------------------------------
tmp$Region.Label: NE
[1] 3.367089
-------------------------------------------------------
tmp$Region.Label: SW
[1] 24.70588
>
> # But it is not quite just the overall mean for the global estimate
> mean(tmp$size)
[1] 13.652
> # But it's definately not a weighted mean in terms of area...
> # must be a weighted mean in terms of estimated density/abundance?
> sum(mean.size.by.strat)/3
[1] 13.4158
> area.props <- region@area/sum(region@area)
> sum(mean.size.by.strat*area.props)
[1] 7.407668
>
> # Yup this seems to work but the estimated density / abundance has some variability associated with it...
> Nhat.props <- fit.sum$dht$clusters$N$Estimate[1:3]/sum(fit.sum$dht$clusters$N$Estimate[1:3])
> sum(mean.size.by.strat*Nhat.props)
[1] 13.43419
Simulated data used different areas and different densities in each stratum - also different mean cluster sizes
Hmm... that means E(s) is estimated likely using hat(N) / hat(N)_s, i.e., ratio of estimated total population size to estimated cluster size (see formula 3.36 in Marques and Buckland 2004 - chapter 3 of the Advanced Distance Sampling book (although my notation is different)). Equation 3.37 has the variance calculation and 3.38 has the relevant covariate. Does that help?
@lenthomas and therein lies the problem... with the uniform model and no adjustments we now have no hessian matrix as required in 3.38. Is the covariance just 0 then if that is entirely associated with the detection function estimation? I assumed you meant covariance rather than covariates above? With the uniform and no adjustments you cannot have covariates right?
That's right no covariates with uniform.
@lenthomas ok so I just made the covariance 0 in this case (commit should be reference above).... my output matches my initial understanding of formula 3.37 (but see below as I realised that the values I can get at easily are not the ones in the formula in the book) and also these estimates bear no resemblance to the group sd(size)/sqrt(n)! See image below...
So to check those values (and they match) I calculated E[s]^2 * {var_hat(N_hat)/N_hat^2 + var_hat(Ns_hat)/Ns_hat^2) where N_hat is the total estimated abundance of individuals and Ns is the total estimated abundance of clusters. In the book it uses the estimates of individual and cluster abundance for the covered area not the whole region.
I couldn't easily get at the relevant values for cluster and individual estimates for the covered areas by stratum which are the values that 3.37 is based on. But I think the above is ok because of the rules for multiplication by a constant which is all that has happened to these values right?
> fit.sum
Summary for distance analysis
Number of observations : 248
Distance range : 0 - 0.5
Model : Uniform key function
AIC : -343.801
Optimisation: mrds (nlminb)
Detection function parameters
Scale coefficient(s):
NULL
Estimate
Average p 1
N in covered region 248
Summary for clusters
Summary statistics:
Region Area CoveredArea Effort n k ER se.ER
1 central 337.5 330 330 82 22 0.2484848 0.03012937
2 NE 37.5 40 40 82 8 2.0500000 0.36005952
3 SW 150.0 150 150 84 15 0.5600000 0.04760952
4 Total 525.0 520 520 248 45 0.4661688 0.03495183
cv.ER
1 0.12125235
2 0.17563879
3 0.08501701
4 0.07497677
Abundance:
Label Estimate se cv lcl ucl df
1 central 83.86364 10.168663 0.12125235 65.23222 107.8165 21.00000
2 NE 76.87500 13.502232 0.17563879 50.90757 116.0882 7.00000
3 SW 84.00000 7.141428 0.08501701 70.02145 100.7691 14.00000
4 Total 244.73864 18.349713 0.07497677 209.43403 285.9946 20.82922
Density:
Label Estimate se cv lcl ucl df
1 central 0.2484848 0.03012937 0.12125235 0.1932807 0.3194563 21.00000
2 NE 2.0500000 0.36005952 0.17563879 1.3575351 3.0956841 7.00000
3 SW 0.5600000 0.04760952 0.08501701 0.4668097 0.6717941 14.00000
4 Total 0.4661688 0.03495183 0.07497677 0.3989220 0.5447516 20.82922
Summary for individuals
Summary statistics:
Region Area CoveredArea Effort n k ER se.ER
1 central 337.5 330 330 982 22 2.975758 0.3960008
2 NE 37.5 40 40 268 8 6.700000 1.1038892
3 SW 150.0 150 150 2102 15 14.013333 1.1491971
4 Total 525.0 520 520 3352 45 6.395368 0.4228860
cv.ER mean.size se.mean
1 0.1330756 11.975610 0.4032298
2 0.1647596 3.268293 0.1691760
3 0.0820074 25.023810 0.5432991
4 0.0661238 13.516129 0.6161852
Abundance:
Label Estimate se cv lcl ucl df
1 central 1004.318 133.65028 0.1330756 762.4461 1322.9197 21.00000
2 NE 251.250 41.39585 0.1647596 170.6236 369.9757 7.00000
3 SW 2102.000 172.37956 0.0820074 1763.4913 2505.4867 14.00000
4 Total 3357.568 222.01515 0.0661238 2934.3388 3841.8413 30.87844
Density:
Label Estimate se cv lcl ucl df
1 central 2.975758 0.3960008 0.1330756 2.259100 3.919762 21.00000
2 NE 6.700000 1.1038892 0.1647596 4.549962 9.866017 7.00000
3 SW 14.013333 1.1491971 0.0820074 11.756609 16.703245 14.00000
4 Total 6.395368 0.4228860 0.0661238 5.589217 7.317793 30.87844
Expected cluster size
Region Expected.S se.Expected.S cv.Expected.S
1 central 11.975610 2.1559842 0.18003127
2 NE 3.268293 0.7870732 0.24082090
3 SW 25.023810 2.9558941 0.11812326
4 Total 13.718995 1.3714792 0.09996936