gergness / srvyr

R package to add 'dplyr'-like Syntax for Summary Statistics of Survey Data
214 stars 28 forks source link

un-expected output with survey_prop when dealing with NAs #156

Closed yannsay-impact closed 3 months ago

yannsay-impact commented 1 year ago

Hello, I am building some functions around srvyr and it seems survey_prop and the proportion argument yields to different outcomes when dealing with missing data.

When having only NA's and 2 groups, the proportion argument set to TRUE, I get an error (traceback at the end of the message). image

If I set different settings with the argument proportion, the function (suvey_mean/suvey_prop), or the dataset (only missing values or only one group) it would give me at least a results but not an error. See below a summary table of the different case and the code with.

Would you have any suggestion how to handle the case 1st case so that the output is predictable (i.e. a dataframe even empty, rather than an error)?

case function proportion argument dataset results
1 survey_prop TRUE only NAs error
2 survey_prop FALSE only NAs data frame
3 survey_mean only NA data frame
4 survey_prop TRUE 2 groups and 1 group have only NA data frame
5 survey_prop FALSE 2 groups and 1 group have only NA data frame
6 survey_mean 2 groups and 1 group have only NA data frame

When using proportion argument set to FALSE, or survey_mean, or having at least 1 group with non-missing data (and

somedata <- data.frame(
  groups = rep(c("a", "b"), 50),
  value = rep(NA_character_, 100)
)

srvyr_survey <- srvyr::as_survey(somedata, strata = groups)

#case 1
srvyr_survey %>%
  dplyr::group_by(dplyr::across(dplyr::any_of("groups"))) %>%
  dplyr::filter(!is.na(value), .preserve = T) %>%
  srvyr::summarise(srvyr::survey_prop(vartype = "ci", proportion = T),
                   n = dplyr::n(),
                   n_w = srvyr::survey_total(
                     vartype = "ci",
                     na.rm = T
                   ))

#case 2
srvyr_survey %>%
  dplyr::group_by(dplyr::across(dplyr::any_of("groups"))) %>%
  dplyr::filter(!is.na(value), .preserve = T) %>%
  srvyr::summarise(srvyr::survey_prop(vartype = "ci", proportion = F),
                   n = dplyr::n(),
                   n_w = srvyr::survey_total(
                     vartype = "ci",
                     na.rm = T
                   ))

#case 3
srvyr_survey %>%
  dplyr::group_by(dplyr::across(dplyr::any_of("groups"))) %>%
  dplyr::filter(!is.na(value), .preserve = T) %>%
  srvyr::summarise(srvyr::survey_mean(vartype = "ci"),
                   n = dplyr::n(),
                   n_w = srvyr::survey_total(
                     vartype = "ci",
                     na.rm = T
                   ))

somedata2 <- data.frame(
  groups = rep(c("a", "b"), 50),
  value = rep(c("aa", NA_character_), 50)
)

srvyr_survey2 <- srvyr::as_survey(somedata2, strata = groups)

#case 4
srvyr_survey2 %>%
  dplyr::group_by(dplyr::across(dplyr::any_of("groups"))) %>%
  dplyr::filter(!is.na(value), .preserve = T) %>%
  srvyr::summarise(srvyr::survey_prop(vartype = "ci", proportion = T),
                   n = dplyr::n(),
                   n_w = srvyr::survey_total(
                     vartype = "ci",
                     na.rm = T
                   ))

#case 5
srvyr_survey2 %>%
  dplyr::group_by(dplyr::across(dplyr::any_of("groups"))) %>%
  dplyr::filter(!is.na(value), .preserve = T) %>%
  srvyr::summarise(srvyr::survey_prop(vartype = "ci", proportion = F),
                   n = dplyr::n(),
                   n_w = srvyr::survey_total(
                     vartype = "ci",
                     na.rm = T
                   ))

#case 6
srvyr_survey2 %>%
  dplyr::group_by(dplyr::across(dplyr::any_of("groups"))) %>%
  dplyr::filter(!is.na(value), .preserve = T) %>%
  srvyr::summarise(srvyr::survey_mean(vartype = "ci"),
                   n = dplyr::n(),
                   n_w = srvyr::survey_total(
                     vartype = "ci",
                     na.rm = T
                   ))

image

gergness commented 1 year ago

Shoot, sorry this took so long, it slipped past my notice.

The root problem is that survey::svyciprop() chokes on 0 row data.frames:

somedata <- data.frame(
    groups = rep(c("a", "b"), 50),
    value = rep(NA_character_, 100)
)

srvyr_survey <- srvyr::as_survey(somedata, strata = groups)

survey::svyciprop(~groups=="a", subset(srvyr_survey, !is.na(value)))
#> Error in family$linkfun(mustart): Argument mu must be a nonempty numeric vector

Created on 2023-03-30 with reprex v2.0.2

I think we could check for empty survey here and return NA? (Need to check what to do for the surveys that keep the data around but set weight to 0 to remove rows) https://github.com/gergness/srvyr/blob/main/R/survey_statistics.r#L175

Any objections @bschneidr or @szimmer?

bschneidr commented 1 year ago

I think that makes sense. We just need to make sure that we check for either zero rows or zero records with finite values of design$prob. It might be worthwhile to add a helper function that does this check for every kind of design: survey.design2, svyrep, and twophase2.