Computation of CVs for individual and group density

erex commented 2 years ago

Because of a question to the Distance list, I did some manual calculation of CVs for the ClusterExercise sample data set. I found disagreement in the manner in which CVs were reported.

Zip file of .Rmd and .html below

CV comparison.zip

erex commented 2 years ago

Similar computations carried out with output (not raw data) provided by Arthur (user referenced in Distance list question). arthurs cvs.pdf

dill commented 2 years ago

@lenthomas and I had a long chat about this. There are quite a few things going on here and we ended-up with more questions than answers... Some thoughts and reminders to self:

the comparison you made in the .Rmd is not quite right as the default estimator in mrds is the Innes et al (2002) so your use of CV(ER) to compare with the total CV shouldn't work (the missing component is given in equation 11 of that paper)
we think that the cluster abundance (i.e., setting size=1) should give the same answers for Innes and Buckland et al (2001) variance estimators when there are no covariates in the detection function
setting the example to use the Buckland et al (2001) estimator (setting er_method=1) still leads to the wrong answer, so there's still something going wrong here (more on this later)
there is a weird rescaling of the variance which we currently don't understand for the Buckland et al (2001) case, which we assume is the source of the issue
when cluster size is not a covariate (but there are other covariates in the detection function) then Distance for Windows uses Buckland et al (2001) not Innes et al (as you might assume from what's in the Advanced Distance Sampling book or Marques et al (2003)).
Distance for Windows doesn't do analytic variance estimation when cluster size is included in the detection function (bootstrap only)
comment in the MCDS source suggests we should look into this c. 2005

dill commented 2 years ago

I followed-up on your calculations @erex by configuring the same analysis in Distance for Windows. The pooled estimates are as follows:

Pooled Estimates:
                         Estimate      %CV     df     95% Confidence Interval
                        ------------------------------------------------------
                 DS     0.21934E-01   31.02    14.17 0.11459E-01  0.41985E-01
                 D      0.51634E-01   33.75    17.95 0.25895E-01  0.10296    
                 N       36935.       33.75    17.95  18523.       73647.

Note the CV here is very close to that reported by mrds/Distance when the Buckland et al (2001) method is used:

Density:
  Label   Estimate          se        cv         lcl        ucl       df
1 North 0.01931297 0.007467003 0.3866315 0.008591546 0.04341371 12.44828
2 South 0.04141444 0.010105480 0.2440086 0.024909968 0.06885418 16.58302
3 Total 0.02193104 0.006801881 0.3101486 0.011457426 0.04197893 14.16840

(Comparing the Total and DS lines.)

I think the difference here comes from the fact that internally the R (maybe also Fortran) code doesn't simply add the CVs, but instead calculates the variances (actually covariance matrices) of the components on the abundance/density scale rather than on the their own scale (i.e., uncertainty about abundance/density with respect to p or n/L). These are then summed (rather than using the CV-squared trick).

erex commented 2 years ago

Thanks @dill for continuing to work through this. I presume the results you present in comment uses the ClusterExercise data set and a hazard rate key function without adjustments and er_method=1. Am I right?

That results in agreement for DistWin and ds when group size is not considered.

That gets us to point 2 in the comment above, correct? I'm assuming there is still detective work to do regarding uncertainty calculations with group size?

In the end, I don't yet know how to get back to Arthur with his question about how to perform variance components assessment for his situation. The last paragraph

R (maybe also Fortran) code doesn't simply add the CVs, but instead calculates the variances (actually covariance matrices) of the components on the abundance/density scale rather than on the their own scale (i.e., uncertainty about abundance/density with respect to p or n/L). These are then summed (rather than using the CV-squared trick).

is something I didn't know and I'm unsure how I would explain it to a user who wanted to get their hands dirty. Is this written down somewhere?

lenthomas commented 2 years ago

Another progress report (me and Dave are working on this at present):

We have worked through the calculations in the original question and have found an issue with the way that CV in total encounter rate is calculated in post-processing. The CV in total density is calculated correctly, but CV in ER (which is calculated separately, after CV density is calculated - i.e., not used in calculation of CV density) is not. This is why the manual calculations of overall CV were not working. Here's a demonstration that when we calculate CV in ER manually and correctly, we get the same answer as ds on CV in density:

data(ClusterExercise)
groups <- ds(ClusterExercise, key="hr", er_method = 1)
s.groups <- summary(groups)
s.groups2 <- s.groups$dht$clusters$summary
#Show CV on ER
s.groups2

#Manual calculation of Total cv.ER

#Just keep strata estimates
s.groups3 <- s.groups2[1:2, ]
#Total ER is area-weighted average of ER in each stratum
ER.tot <- with(s.groups3, sum(Area * ER) / sum(Area))
#Total var is area^2-weighted average of var in each stratum
var.ER.tot <- with(s.groups3, sum(Area ^ 2 * se.ER ^ 2) / sum(Area) ^ 2)
#CV is sqrt(var / ER^2)
cv.ER.tot <- sqrt(var.ER.tot / ER.tot ^ 2)

#Note this is different from what s.groups2 reports
print(cv.ER.tot)

#Work out total CV on density
cv.p <- s.groups$ds$average.p.se[1,1] / s.groups$ds$average.p
cv.tot <- sqrt(cv.p ^ 2 + cv.ER.tot ^ 2)
print(cv.tot)

#Note the above cv.tot *is* what s.groups$dht$clusters$D
s.groups$dht$clusters$D

#So, ds seems to be getting the total CV on encounter rate wrong, but
# the total CV on density right.  This appears then to be "just" a 
# post-processing issue in summary

Dave has isolated the bug - it occurs in mrds in dht.R line 310.

(Unfortunately, total CV in ER is not reported by DistWin which made it a bit harder to double check, but manual calculation did the trick.)

lenthomas commented 2 years ago

This bug also affected the calculated total ER, which was also reported incorrectly.

lenthomas commented 2 years ago

@erex could you please send (via email so private) Arthur's data to @dill and me, so we can check fixing this also resolves his issue, mentioned above.

lenthomas commented 2 years ago

BTW @erex, we are also discussing how to enhance the documentation around DistWin and the R packages to make the methods for point and variance estimation with and without covariates more transparent. Will be happy to jointly discuss this at a DistDev meeting.

lenthomas commented 2 years ago

Another BTW for the record: the CV on density of clusters is not affected by whether we use the Buckland et al. (2001) method or the Innes et al. (2002) method:

data(ClusterExercise)
groups <- ds(ClusterExercise, key="hr", er_method = 1)
tmp <- summary(groups)
tmp$dht$clusters$D

groups <- ds(ClusterExercise, key="hr", er_method = 2)
tmp <- summary(groups)
tmp$dht$clusters$D

Both the same:

This is because things cancel out so the two variance estimators are the same mathematically.

However, the CV on density of individuals does depend on the variance estimation method.

data(ClusterExercise)
groups <- ds(ClusterExercise, key="hr", er_method = 1)
tmp <- summary(groups)
tmp$dht$individuals$D

groups <- ds(ClusterExercise, key="hr", er_method = 2)
tmp <- summary(groups)
tmp$dht$individuals$D

Gives

Different, with the CV calculated using Innes et al. (method = 2) smaller than Buckland et al (method = 1). This is because the former now includes an additional component for CV(E(s)) while the latter just uses the line-to-line CV in variance of estimated individuals, which is the same as the line-to-line CV in variance of estimated clusters since there are no covariates. In this sense, method = 2 is ignoring cluster size variance, and so appears to be underestimating overall variance. (This would not happen if there is a covariate in the df model that is correlated with cluster size, or even cluster size is in the model.)

This topic needs more thinking regarding what is the best default, and then some documentation!

erex commented 2 years ago

@lenthomas Should you edit your previous comment about Innes vs Buckland variance sizes. The output suggests er_method=1 produces smaller SE and CV than output of er_method=2?

lenthomas commented 2 years ago

Hmm, yes, clearly my understanding is flawed - Innes et al produces a larger variance. Not sure why. Thoughts, @dill?

lenthomas commented 2 years ago

One thought is that the Buckland estimator assumes group size and encounter rate are independent, while Innes et al does not.

erex commented 2 years ago

While your fingers are deep inside this issue, I want to alert you to an oddity in precision measures for abundance/density of individuals in total study area results of differing er_method argument values when analysing one of Arthur's replicates (that contains no detections in one of his strata):

runs below conducted using this version of Distance Distance * 1.0.5.9001 2022-08-10 [1] Github (DistanceDevelopment/Distance@b386d14)

er_method=1

df <- read.csv(file='E:/troubleshoot/arthur/ds_camargue.csv',sep= ';')
df$Area <- df$Area / 1000^2
df$Effort <- df$Effort / 1000
df$distance <- df$distance / 1000
convu <- convert_units("kilometer", "kilometer", "square kilometer")
dist.bins <- c(0,60,160,360)/1000
groups <- ds(df, key="hn", cutpoints = dist.bins, er_method = 1)

Summary for individuals

Summary statistics:
     Region      Area CoveredArea  Effort     n  k        ER     se.ER     cv.ER mean.size   se.mean
1  camargue 439.19964   252.49824 350.692 63401 22 180.78827 42.345959 0.2342296  960.6212 185.85088
2      gard 188.34741   108.64656 150.898 19178 15 127.09247 41.824256 0.3290852  767.1200 167.91072
3     rhone  13.23249     7.64712  10.621     0  3   0.00000  0.000000 0.0000000    0.0000   0.00000
4 vigueirat  18.55230    10.95480  15.215   332  2  21.82057  7.069714 0.3239930  110.6667  84.69816
5     Total 659.33184   379.74672 527.426 82911 42 157.19930 30.729351 0.1954802  882.0319 138.48758

Abundance:
      Label    Estimate         se        cv          lcl       ucl         df
1  camargue 195650.6827 49143.6563 0.2511806 119881.07706 319309.69 117.699918
2      gard  58983.2393 21821.1954 0.3699559  28503.72831 122055.00  34.838134
3     rhone      0.0000     0.0000 0.0000000           NA        NA         NA
4 vigueirat    997.5001   798.1894 0.8001898     72.46542  13730.78   2.362988
5     Total 255631.4221     0.0000 0.0000000      0.00000      0.00   0.000000

Density:
      Label  Estimate        se        cv        lcl      ucl         df
1  camargue 445.47095 111.89366 0.2511806 272.953493 727.0263 117.699918
2      gard 313.16194 115.85610 0.3699559 151.335919 648.0312  34.838134
3     rhone   0.00000   0.00000 0.0000000         NA       NA         NA
4 vigueirat  53.76693  43.02375 0.8001898   3.906008 740.1120   2.362988
5     Total 387.71284   0.00000 0.0000000   0.000000   0.0000   0.000000

er_method=2

groups <- ds(df, key="hn", cutpoints = dist.bins, er_method = 2)
Summary for individuals

Summary statistics:
     Region      Area CoveredArea  Effort     n  k        ER     se.ER     cv.ER mean.size   se.mean
1  camargue 439.19964   252.49824 350.692 63401 22 180.78827 42.345959 0.2342296  960.6212 185.85088
2      gard 188.34741   108.64656 150.898 19178 15 127.09247 41.824256 0.3290852  767.1200 167.91072
3     rhone  13.23249     7.64712  10.621     0  3   0.00000  0.000000 0.0000000    0.0000   0.00000
4 vigueirat  18.55230    10.95480  15.215   332  2  21.82057  7.069714 0.3239930  110.6667  84.69816
5     Total 659.33184   379.74672 527.426 82911 42 157.19930 30.729351 0.1954802  882.0319 138.48758

Abundance:
      Label    Estimate         se        cv          lcl      ucl        df
1  camargue 195650.6827 49978.7900 0.2554491 117034.46749 327076.2 29.469070
2      gard  58983.2393 20320.3754 0.3445110  29078.93897 119640.6 16.792117
3     rhone      0.0000     0.0000 0.0000000      0.00000      0.0  0.000000
4 vigueirat    997.5001   338.8011 0.3396502     59.21283  16803.9  1.207644
5     Total 255631.4221 56178.3864 0.2197632 165027.69422 395978.5 44.244253

Density:
      Label  Estimate        se        cv        lcl      ucl        df
1  camargue 445.47095 113.79515 0.2554491 266.472137 744.7096 29.469070
2      gard 313.16194 107.88774 0.3445110 154.389907 635.2125 16.792117
3     rhone   0.00000   0.00000 0.0000000   0.000000   0.0000  0.000000
4 vigueirat  53.76693  18.26195 0.3396502   3.191671 905.7585  1.207644
5     Total 387.71284  85.20503 0.2197632 250.295352 600.5755 44.244253

lenthomas commented 1 year ago

Just a note to say that @dill is going to post a response to the original question that raised the issue, about variance decomposition, here.

dill commented 1 year ago

I'll assume that @LHMarshall's work at #61 fixes the issue raised here.

Going back to the original issue, the variance contributions are hard to calculate from these tables.

Two rough recipes:

Buckland et al (2001): take CV of the ps, the required encounter rate variance estimate (using n/L) and the empirical group size estimate. Add squared CVs. If no group size, then set this component to zero.
Innes et al (2002): ttake CV of the ps and the required encounter rate variance estimate (using Nhat/L). Group size variance is included in the encounter rate estimator (see Marques and Buckland 2003 and Innes et al 2002).

In either case, this is what we end up with in the variance of Nhat in the dht tables. So one could take CV(Nhat) and subtract the other CVs (squared) to get the components. But this should be resolved by #70 (once I've done that).

DistanceDevelopment / mrds

Computation of CVs for individual and group density #52

er_method=1

er_method=2