Open erex opened 2 years ago
Similar computations carried out with output (not raw data) provided by Arthur (user referenced in Distance list question). arthurs cvs.pdf
@lenthomas and I had a long chat about this. There are quite a few things going on here and we ended-up with more questions than answers... Some thoughts and reminders to self:
.Rmd
is not quite right as the default estimator in mrds
is the Innes et al (2002) so your use of CV(ER) to compare with the total CV shouldn't work (the missing component is given in equation 11 of that paper)size=1
) should give the same answers for Innes and Buckland et al (2001) variance estimators when there are no covariates in the detection functioner_method=1
) still leads to the wrong answer, so there's still something going wrong here (more on this later)I followed-up on your calculations @erex by configuring the same analysis in Distance for Windows. The pooled estimates are as follows:
Pooled Estimates:
Estimate %CV df 95% Confidence Interval
------------------------------------------------------
DS 0.21934E-01 31.02 14.17 0.11459E-01 0.41985E-01
D 0.51634E-01 33.75 17.95 0.25895E-01 0.10296
N 36935. 33.75 17.95 18523. 73647.
Note the CV here is very close to that reported by mrds
/Distance
when the Buckland et al (2001) method is used:
Density:
Label Estimate se cv lcl ucl df
1 North 0.01931297 0.007467003 0.3866315 0.008591546 0.04341371 12.44828
2 South 0.04141444 0.010105480 0.2440086 0.024909968 0.06885418 16.58302
3 Total 0.02193104 0.006801881 0.3101486 0.011457426 0.04197893 14.16840
(Comparing the Total
and DS
lines.)
I think the difference here comes from the fact that internally the R (maybe also Fortran) code doesn't simply add the CVs, but instead calculates the variances (actually covariance matrices) of the components on the abundance/density scale rather than on the their own scale (i.e., uncertainty about abundance/density with respect to p or n/L). These are then summed (rather than using the CV-squared trick).
Thanks @dill for continuing to work through this. I presume the results you present in comment uses the ClusterExercise
data set and a hazard rate key function without adjustments and er_method=1
. Am I right?
That results in agreement for DistWin and ds
when group size is not considered.
That gets us to point 2 in the comment above, correct? I'm assuming there is still detective work to do regarding uncertainty calculations with group size?
In the end, I don't yet know how to get back to Arthur with his question about how to perform variance components assessment for his situation. The last paragraph
R (maybe also Fortran) code doesn't simply add the CVs, but instead calculates the variances (actually covariance matrices) of the components on the abundance/density scale rather than on the their own scale (i.e., uncertainty about abundance/density with respect to p or n/L). These are then summed (rather than using the CV-squared trick).
is something I didn't know and I'm unsure how I would explain it to a user who wanted to get their hands dirty. Is this written down somewhere?
Another progress report (me and Dave are working on this at present):
We have worked through the calculations in the original question and have found an issue with the way that CV in total encounter rate is calculated in post-processing. The CV in total density is calculated correctly, but CV in ER (which is calculated separately, after CV density is calculated - i.e., not used in calculation of CV density) is not. This is why the manual calculations of overall CV were not working. Here's a demonstration that when we calculate CV in ER manually and correctly, we get the same answer as ds on CV in density:
data(ClusterExercise)
groups <- ds(ClusterExercise, key="hr", er_method = 1)
s.groups <- summary(groups)
s.groups2 <- s.groups$dht$clusters$summary
#Show CV on ER
s.groups2
#Manual calculation of Total cv.ER
#Just keep strata estimates
s.groups3 <- s.groups2[1:2, ]
#Total ER is area-weighted average of ER in each stratum
ER.tot <- with(s.groups3, sum(Area * ER) / sum(Area))
#Total var is area^2-weighted average of var in each stratum
var.ER.tot <- with(s.groups3, sum(Area ^ 2 * se.ER ^ 2) / sum(Area) ^ 2)
#CV is sqrt(var / ER^2)
cv.ER.tot <- sqrt(var.ER.tot / ER.tot ^ 2)
#Note this is different from what s.groups2 reports
print(cv.ER.tot)
#Work out total CV on density
cv.p <- s.groups$ds$average.p.se[1,1] / s.groups$ds$average.p
cv.tot <- sqrt(cv.p ^ 2 + cv.ER.tot ^ 2)
print(cv.tot)
#Note the above cv.tot *is* what s.groups$dht$clusters$D
s.groups$dht$clusters$D
#So, ds seems to be getting the total CV on encounter rate wrong, but
# the total CV on density right. This appears then to be "just" a
# post-processing issue in summary
Dave has isolated the bug - it occurs in mrds
in dht.R
line 310.
(Unfortunately, total CV in ER is not reported by DistWin which made it a bit harder to double check, but manual calculation did the trick.)
This bug also affected the calculated total ER, which was also reported incorrectly.
@erex could you please send (via email so private) Arthur's data to @dill and me, so we can check fixing this also resolves his issue, mentioned above.
BTW @erex, we are also discussing how to enhance the documentation around DistWin and the R
packages to make the methods for point and variance estimation with and without covariates more transparent. Will be happy to jointly discuss this at a DistDev meeting.
Another BTW for the record: the CV on density of clusters is not affected by whether we use the Buckland et al. (2001) method or the Innes et al. (2002) method:
data(ClusterExercise)
groups <- ds(ClusterExercise, key="hr", er_method = 1)
tmp <- summary(groups)
tmp$dht$clusters$D
groups <- ds(ClusterExercise, key="hr", er_method = 2)
tmp <- summary(groups)
tmp$dht$clusters$D
Both the same:
This is because things cancel out so the two variance estimators are the same mathematically.
However, the CV on density of individuals does depend on the variance estimation method.
data(ClusterExercise)
groups <- ds(ClusterExercise, key="hr", er_method = 1)
tmp <- summary(groups)
tmp$dht$individuals$D
groups <- ds(ClusterExercise, key="hr", er_method = 2)
tmp <- summary(groups)
tmp$dht$individuals$D
Gives
Different, with the CV calculated using Innes et al. (method = 2) smaller than Buckland et al (method = 1). This is because the former now includes an additional component for CV(E(s)) while the latter just uses the line-to-line CV in variance of estimated individuals, which is the same as the line-to-line CV in variance of estimated clusters since there are no covariates. In this sense, method = 2 is ignoring cluster size variance, and so appears to be underestimating overall variance. (This would not happen if there is a covariate in the df model that is correlated with cluster size, or even cluster size is in the model.)
This topic needs more thinking regarding what is the best default, and then some documentation!
@lenthomas Should you edit your previous comment about Innes vs Buckland variance sizes. The output suggests er_method=1
produces smaller SE and CV than output of er_method=2
?
Hmm, yes, clearly my understanding is flawed - Innes et al produces a larger variance. Not sure why. Thoughts, @dill?
One thought is that the Buckland estimator assumes group size and encounter rate are independent, while Innes et al does not.
While your fingers are deep inside this issue, I want to alert you to an oddity in precision measures for abundance/density of individuals in total study area results of differing er_method
argument values when analysing one of Arthur's replicates (that contains no detections in one of his strata):
runs below conducted using this version of Distance
Distance * 1.0.5.9001 2022-08-10 [1] Github (DistanceDevelopment/Distance@b386d14)
df <- read.csv(file='E:/troubleshoot/arthur/ds_camargue.csv',sep= ';')
df$Area <- df$Area / 1000^2
df$Effort <- df$Effort / 1000
df$distance <- df$distance / 1000
convu <- convert_units("kilometer", "kilometer", "square kilometer")
dist.bins <- c(0,60,160,360)/1000
groups <- ds(df, key="hn", cutpoints = dist.bins, er_method = 1)
Summary for individuals
Summary statistics:
Region Area CoveredArea Effort n k ER se.ER cv.ER mean.size se.mean
1 camargue 439.19964 252.49824 350.692 63401 22 180.78827 42.345959 0.2342296 960.6212 185.85088
2 gard 188.34741 108.64656 150.898 19178 15 127.09247 41.824256 0.3290852 767.1200 167.91072
3 rhone 13.23249 7.64712 10.621 0 3 0.00000 0.000000 0.0000000 0.0000 0.00000
4 vigueirat 18.55230 10.95480 15.215 332 2 21.82057 7.069714 0.3239930 110.6667 84.69816
5 Total 659.33184 379.74672 527.426 82911 42 157.19930 30.729351 0.1954802 882.0319 138.48758
Abundance:
Label Estimate se cv lcl ucl df
1 camargue 195650.6827 49143.6563 0.2511806 119881.07706 319309.69 117.699918
2 gard 58983.2393 21821.1954 0.3699559 28503.72831 122055.00 34.838134
3 rhone 0.0000 0.0000 0.0000000 NA NA NA
4 vigueirat 997.5001 798.1894 0.8001898 72.46542 13730.78 2.362988
5 Total 255631.4221 0.0000 0.0000000 0.00000 0.00 0.000000
Density:
Label Estimate se cv lcl ucl df
1 camargue 445.47095 111.89366 0.2511806 272.953493 727.0263 117.699918
2 gard 313.16194 115.85610 0.3699559 151.335919 648.0312 34.838134
3 rhone 0.00000 0.00000 0.0000000 NA NA NA
4 vigueirat 53.76693 43.02375 0.8001898 3.906008 740.1120 2.362988
5 Total 387.71284 0.00000 0.0000000 0.000000 0.0000 0.000000
groups <- ds(df, key="hn", cutpoints = dist.bins, er_method = 2)
Summary for individuals
Summary statistics:
Region Area CoveredArea Effort n k ER se.ER cv.ER mean.size se.mean
1 camargue 439.19964 252.49824 350.692 63401 22 180.78827 42.345959 0.2342296 960.6212 185.85088
2 gard 188.34741 108.64656 150.898 19178 15 127.09247 41.824256 0.3290852 767.1200 167.91072
3 rhone 13.23249 7.64712 10.621 0 3 0.00000 0.000000 0.0000000 0.0000 0.00000
4 vigueirat 18.55230 10.95480 15.215 332 2 21.82057 7.069714 0.3239930 110.6667 84.69816
5 Total 659.33184 379.74672 527.426 82911 42 157.19930 30.729351 0.1954802 882.0319 138.48758
Abundance:
Label Estimate se cv lcl ucl df
1 camargue 195650.6827 49978.7900 0.2554491 117034.46749 327076.2 29.469070
2 gard 58983.2393 20320.3754 0.3445110 29078.93897 119640.6 16.792117
3 rhone 0.0000 0.0000 0.0000000 0.00000 0.0 0.000000
4 vigueirat 997.5001 338.8011 0.3396502 59.21283 16803.9 1.207644
5 Total 255631.4221 56178.3864 0.2197632 165027.69422 395978.5 44.244253
Density:
Label Estimate se cv lcl ucl df
1 camargue 445.47095 113.79515 0.2554491 266.472137 744.7096 29.469070
2 gard 313.16194 107.88774 0.3445110 154.389907 635.2125 16.792117
3 rhone 0.00000 0.00000 0.0000000 0.000000 0.0000 0.000000
4 vigueirat 53.76693 18.26195 0.3396502 3.191671 905.7585 1.207644
5 Total 387.71284 85.20503 0.2197632 250.295352 600.5755 44.244253
Just a note to say that @dill is going to post a response to the original question that raised the issue, about variance decomposition, here.
I'll assume that @LHMarshall's work at #61 fixes the issue raised here.
Going back to the original issue, the variance contributions are hard to calculate from these tables.
Two rough recipes:
In either case, this is what we end up with in the variance of Nhat in the dht
tables. So one could take CV(Nhat) and subtract the other CVs (squared) to get the components. But this should be resolved by #70 (once I've done that).
Because of a question to the Distance list, I did some manual calculation of CVs for the
ClusterExercise
sample data set. I found disagreement in the manner in which CVs were reported.Zip file of .Rmd and .html below
CV comparison.zip