Closed aspina7 closed 2 years ago
Clarification: it's not happening in all stratified surveys, but it is in some of them. The bug appears when the survey design is weighted with 1 - level cluster sampling design with strata and the table is stratified with the strata defined. I'll try to get a reprex working soon.
Update: as I'm going through this, I'm thinking more and more that it may have to do with the way the weights are constructed.
When you look at the population data, the total for each population is 10k, but when you tabulate the populations with the survey data, each one comes out between 160k and 170k.
Is the correct way to calculate the cluster sampling with survey to multiply the stratified weights with the cluster weights?
When you look at the population data, the total for each population is 10k, but when you tabulate the populations with the survey data, each one comes out between 160k and 170k.
Is the correct way to calculate the cluster sampling with survey to multiply the stratified weights with the cluster weights?
not sure i understand the question?
I WILL create a reprex soon enough, but for now, here's what I have:
Observed, when we attempt to count death by health district (a stratifier), we end up with this:
> tab_survey(survey_design, died,
+ strata = health_district,
+ deff = TRUE,
+ keep = "TRUE")
# A tibble: 1 x 8
variable value `District A n` `District A ci` `District A deff` `District B n` `District B ci` `District B deff`
<chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 died TRUE 9485. 6.2% (0.0--NaN) 0.102 8216. 4.4% (0.0--NaN) 0.102
Same thing happens if we use the cluster weighted data:
> tab_survey(survey_design_cluster, died,
+ strata = health_district,
+ deff = TRUE,
+ keep = "TRUE")
# A tibble: 1 x 8
variable value `District A n` `District A ci` `District A deff` `District B n` `District B ci` `District B deff`
<chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 died TRUE 229. 6.6% (0.0--NaN) 7.99 138. 3.5% (0.0--NaN) 7.99
BUT if we use the stratified weighted data, the CIs are back to normal:
> tab_survey(survey_design_strata, died,
+ strata = health_district,
+ deff = TRUE,
+ keep = "TRUE")
# A tibble: 1 x 8
variable value `District A n` `District A ci` `District A deff` `District B n` `District B ci` `District B deff`
<chr> <chr> <dbl> <chr> <dbl> <dbl> <chr> <dbl>
1 died TRUE 670. 6.7% (3.1--14.0) 2.58 518. 5.2% (1.7--14.4) 2.58
I'm not sure WHY this is happening, but it's clear that there is some weird conflict between the stratifying variable and cluster weighting: https://github.com/R4EPI/sitrep/blob/5ebdfc428a64968678547da8e9fd23627ab32d41/inst/rmarkdown/templates/mortality/skeleton/skeleton.Rmd#L831-L861
Note: that if you attempt to stratify by Village, both ends of the CI are NaN, I suspect that this may be the root cause of the error since the health districts and villages are correlated:
> survey_design_cluster %>% as_tibble() %>% select(village, health_district) %>% table()
health_district
village District A District B
Village A 125 0
Village B 117 0
Village C 0 133
Village D 0 96
>
FOUND A REPRODUCIBLE EXAMPLE:
Effectively, the problem is that, with only two clusters per district and two districts, you end up with a situation where the degrees of freedom is 1, which causes glm to choke (for reasons) and basically say that your confidence interval could really be anything, buddy!
The proposed solution is to adjust the degrees of freedom to infinity.
No one in the original SO post seems to know whether or not this is a "good idea"....
and I'm all (屮゚Д゚)屮
For what it's worth, the degrees of freedom is set to infinity for all methods except "beta", "likelihood", and "logit" in the svyciprop()
function with a caveat:
df - denominator degrees of freedom, for all methods except "beta". Use Inf for confidence intervals based on a Normal distribution, and for "likelihood" and "logit" use NULL for the default method in glms (currently degf(design)-1, but this may be improved in the future)
I have NO clue what "may be improved in the future means, but I think we can now comfortably rest knowing that we have found the source of the problem.
I will add a df parameter to the function and see what happens.
Update: It turns out that it's not as simple as adding a df parameter because srvyr does not pass it to the estimation itself and that is .... frustrating.
When I think about this more, it has more to do with the fact that it is giving us the correct result, for attempting to compare one of the two districts where we only have two villages. The degrees of freedom for the original survey design is 2. When we take away one of the health districts in order to calculate the proportion of those who died or the proportions for each disease, we end up with one degree of freedom, which means that we are basically trying to fit a line between two points.
I'm going to leave this as it is for now and work on cleaning up the templates for consistency
FWIW, if you take the proportions from the entire data set, you do get the confidence intervals returning.
mmm so this is also the case NaN thing is also the case in tab_univariate...
mmm so this is also the case NaN thing is also the case in tab_univariate...
Care to elaborate? I believe we addressed this issue as not a bug since the degrees of freedom for our examples were too low to produce confidence intervals.
If there is a bug in tab_univariate, create a reprex and open a new issue.
my bad see https://r4epis.netlify.com/training/walk-through/univariate/ under the stratified bit. You are right it only appears when cannot compute...
no longer relevant as being retired
stratified tab_surveys seem to not produce confidence intervals. All return 0--NAN (see outputs of mortality and vaccination for examples)