gergness / srvyr

R package to add 'dplyr'-like Syntax for Summary Statistics of Survey Data
214 stars 28 forks source link

Missing values in survey_prop and survey_mean #174

Closed tharkanen closed 4 months ago

tharkanen commented 4 months ago

I created some missing values in the example data:

> apistrat2 <- apistrat
> apistrat2$awards[sample(1:200, 20)] <- NA
> strat_design_srvyr <- apistrat2 %>% as_survey_design(1, strata = stype, fpc = fpc, weight = pw, variables = c(stype, awards, starts_with("api")))
> strat_design_srvyr |> group_by(awards) |> summarize(m= survey_mean())
# A tibble: 3 × 3
  awards      m   m_se
  <fct>   <dbl>  <dbl>
1 No     0.333  0.0338
2 Yes    0.572  0.0362
3 NA     0.0953 0.0221
> strat_design_srvyr |> group_by(awards) |> summarize(m= survey_mean(na.rm=TRUE))
# A tibble: 3 × 3
  awards      m   m_se
  <fct>   <dbl>  <dbl>
1 No     0.333  0.0338
2 Yes    0.572  0.0362
3 NA     0.0953 0.0221

I expected that na.rm=TRUE would drop the NA from the output, which I need. So No+Yes would be 100%, not 90.5%. Is there some solution to this?

bschneidr commented 4 months ago

Hi @tharkanen, you're in good company: this is a recurring point of confusion for users (#161).

The answer though is that for a grouping variable used in group_by(), if you want to remove missing values then you have to filter those out first before calling group_by() or summarize(). The version of the package which is on GitHub has some documentation which explains this, though the CRAN version doesn't yet have that documentation.

Here's what that documentation looks like: https://github.com/gergness/srvyr/pull/161/files

tharkanen commented 4 months ago

Thanks a lot @bschneidr! These instructions give the same results without the NA category:

> strat_design_srvyr |> summarize(m=survey_mean(awards=="No", na.rm=TRUE))
# A tibble: 1 × 2
      m   m_se
  <dbl>  <dbl>
1 0.368 0.0365
> strat_design_srvyr |> summarize(m=survey_mean(awards=="Yes", na.rm=TRUE))
# A tibble: 1 × 2
      m   m_se
  <dbl>  <dbl>
1 0.632 0.0365
> strat_design_srvyr |> filter(!is.na(awards)) |> group_by(awards) |> summarize(m= survey_mean(na.rm=TRUE))
# A tibble: 2 × 3
  awards     m   m_se
  <fct>  <dbl>  <dbl>
1 No     0.368 0.0365
2 Yes    0.632 0.0365