R4EPI / sitrep

Report templates and helper functions for applied epidemiology
https://r4epi.github.io/sitrep/
GNU General Public License v3.0
40 stars 14 forks source link

NaN in CI for tab_survey in mortality survey #203

Closed aspina7 closed 2 years ago

aspina7 commented 4 years ago

stratified tab_surveys seem to not produce confidence intervals. All return 0--NAN (see outputs of mortality and vaccination for examples)

zkamvar commented 4 years ago

Clarification: it's not happening in all stratified surveys, but it is in some of them. The bug appears when the survey design is weighted with 1 - level cluster sampling design with strata and the table is stratified with the strata defined. I'll try to get a reprex working soon.

zkamvar commented 4 years ago

Update: as I'm going through this, I'm thinking more and more that it may have to do with the way the weights are constructed.

zkamvar commented 4 years ago

When you look at the population data, the total for each population is 10k, but when you tabulate the populations with the survey data, each one comes out between 160k and 170k.

Is the correct way to calculate the cluster sampling with survey to multiply the stratified weights with the cluster weights?

aspina7 commented 4 years ago

When you look at the population data, the total for each population is 10k, but when you tabulate the populations with the survey data, each one comes out between 160k and 170k.

Is the correct way to calculate the cluster sampling with survey to multiply the stratified weights with the cluster weights?

not sure i understand the question?

zkamvar commented 4 years ago

I WILL create a reprex soon enough, but for now, here's what I have:

Observed, when we attempt to count death by health district (a stratifier), we end up with this:

> tab_survey(survey_design, died, 
+                                 strata = health_district,
+                                 deff = TRUE, 
+                                 keep = "TRUE")
# A tibble: 1 x 8
  variable value `District A n` `District A ci` `District A deff` `District B n` `District B ci` `District B deff`
  <chr>    <chr>          <dbl> <chr>                       <dbl>          <dbl> <chr>                       <dbl>
1 died     TRUE           9485. 6.2% (0.0--NaN)             0.102          8216. 4.4% (0.0--NaN)             0.102

Same thing happens if we use the cluster weighted data:

> tab_survey(survey_design_cluster, died, 
+                                 strata = health_district,
+                                 deff = TRUE, 
+                                 keep = "TRUE")
# A tibble: 1 x 8
  variable value `District A n` `District A ci` `District A deff` `District B n` `District B ci` `District B deff`
  <chr>    <chr>          <dbl> <chr>                       <dbl>          <dbl> <chr>                       <dbl>
1 died     TRUE            229. 6.6% (0.0--NaN)              7.99           138. 3.5% (0.0--NaN)              7.99

BUT if we use the stratified weighted data, the CIs are back to normal:

> tab_survey(survey_design_strata, died, 
+                                 strata = health_district,
+                                 deff = TRUE, 
+                                 keep = "TRUE")
# A tibble: 1 x 8
  variable value `District A n` `District A ci`  `District A deff` `District B n` `District B ci`  `District B deff`
  <chr>    <chr>          <dbl> <chr>                        <dbl>          <dbl> <chr>                        <dbl>
1 died     TRUE            670. 6.7% (3.1--14.0)              2.58           518. 5.2% (1.7--14.4)              2.58

I'm not sure WHY this is happening, but it's clear that there is some weird conflict between the stratifying variable and cluster weighting: https://github.com/R4EPI/sitrep/blob/5ebdfc428a64968678547da8e9fd23627ab32d41/inst/rmarkdown/templates/mortality/skeleton/skeleton.Rmd#L831-L861

zkamvar commented 4 years ago

Note: that if you attempt to stratify by Village, both ends of the CI are NaN, I suspect that this may be the root cause of the error since the health districts and villages are correlated:

> survey_design_cluster %>% as_tibble() %>% select(village, health_district) %>% table()
           health_district
village     District A District B
  Village A        125          0
  Village B        117          0
  Village C          0        133
  Village D          0         96
>  
zkamvar commented 4 years ago

FOUND A REPRODUCIBLE EXAMPLE:

https://stackoverflow.com/questions/51635867/r-svyciprop-not-working-with-just-two-groups-in-a-cluster

zkamvar commented 4 years ago

Effectively, the problem is that, with only two clusters per district and two districts, you end up with a situation where the degrees of freedom is 1, which causes glm to choke (for reasons) and basically say that your confidence interval could really be anything, buddy!

The proposed solution is to adjust the degrees of freedom to infinity.

No one in the original SO post seems to know whether or not this is a "good idea"....

and I'm all (屮゚Д゚)屮

For what it's worth, the degrees of freedom is set to infinity for all methods except "beta", "likelihood", and "logit" in the svyciprop() function with a caveat:

df - denominator degrees of freedom, for all methods except "beta". Use Inf for confidence intervals based on a Normal distribution, and for "likelihood" and "logit" use NULL for the default method in glms (currently degf(design)-1, but this may be improved in the future)

I have NO clue what "may be improved in the future means, but I think we can now comfortably rest knowing that we have found the source of the problem.

I will add a df parameter to the function and see what happens.

zkamvar commented 4 years ago

Update: It turns out that it's not as simple as adding a df parameter because srvyr does not pass it to the estimation itself and that is .... frustrating.

zkamvar commented 4 years ago

When I think about this more, it has more to do with the fact that it is giving us the correct result, for attempting to compare one of the two districts where we only have two villages. The degrees of freedom for the original survey design is 2. When we take away one of the health districts in order to calculate the proportion of those who died or the proportions for each disease, we end up with one degree of freedom, which means that we are basically trying to fit a line between two points.

I'm going to leave this as it is for now and work on cleaning up the templates for consistency

zkamvar commented 4 years ago

FWIW, if you take the proportions from the entire data set, you do get the confidence intervals returning.

aspina7 commented 4 years ago

mmm so this is also the case NaN thing is also the case in tab_univariate...

zkamvar commented 4 years ago

mmm so this is also the case NaN thing is also the case in tab_univariate...

Care to elaborate? I believe we addressed this issue as not a bug since the degrees of freedom for our examples were too low to produce confidence intervals.

If there is a bug in tab_univariate, create a reprex and open a new issue.

aspina7 commented 4 years ago

my bad see https://r4epis.netlify.com/training/walk-through/univariate/ under the stratified bit. You are right it only appears when cannot compute...

aspina7 commented 2 years ago

no longer relevant as being retired