gergness / srvyr

R package to add 'dplyr'-like Syntax for Summary Statistics of Survey Data
213 stars 28 forks source link

survey_mean ignores "na.rm"? #176

Closed hpreysg closed 3 months ago

hpreysg commented 3 months ago

Dear srvyr-author

I am trying to calculate percentages that only take into consideration non-missing values but the option "na.rm" does not seem to work. The following code shows two variants, one with "na.rm = TRUE" and the other with "na.rm = FALSE" but the result ist the same. Did I misunderstand something? Any help is very much appreciated!

================

library(srvyr) data(api, package = "survey") table(apistrat$target, useNA = "ifany")

dstrata <- apistrat %>% as_survey_design(strata = stype, weights = pw)

d.1 <- dstrata |> group_by(target) |> summarize(estpct = survey_mean(na.rm = TRUE))

d.2 <- dstrata |> group_by(target) |> summarize(estpct = survey_mean(na.rm = FALSE))

gergness commented 3 months ago

The documentation has been improved in the development version, but this is a common confusion about srvyr.

To get what you want, you need to filter the missing values out, before grouping:

d.1 <- dstrata |>
filter(!is.na(target)) |>
group_by(target) |>
summarize(estpct = survey_mean(na.rm = TRUE))

I think Ben does a good job describing here if you want to read more: https://github.com/gergness/srvyr/issues/149#issuecomment-1345464665

I doubt this will help, but the way I think of it is that it's not survey_mean() that's ignoring the na.rm, it's group_by() that has the grouping variable, which matches dplyr's behavior that it does not drop missing. No variable is passed to survey_mean() so it doesn't have any awareness of missingness or not.

hpreysg commented 3 months ago

Thank you for your fast reply. This clears the issue for me.

Von: Greg Freedman Ellis @.> Gesendet: Montag, 22. Juli 2024 16:45 An: gergness/srvyr @.> Cc: Prey Hedwig VD-GS-FfS @.>; Author @.> Betreff: Re: [gergness/srvyr] survey_mean ignores "na.rm"? (Issue #176)

The documentation has been improved in the development version, but this is a common confusion about srvyr.

To get what you want, you need to filter the missing values out, before grouping:

d.1 <- dstrata |>

filter(!is.na(target)) |>

group_by(target) |>

summarize(estpct = survey_mean(na.rm = TRUE))

I think Ben does a good job describing here if you want to read more: #149 (comment)https://github.com/gergness/srvyr/issues/149#issuecomment-1345464665

I doubt this will help, but the way I think of it is that it's not survey_mean() that's ignoring the na.rm, it's group_by() that has the grouping variable, which matches dplyr's behavior that it does not drop missing. No variable is passed to survey_mean() so it doesn't have any awareness of missingness or not.

— Reply to this email directly, view it on GitHubhttps://github.com/gergness/srvyr/issues/176#issuecomment-2243140439, or unsubscribehttps://github.com/notifications/unsubscribe-auth/BG6ULDADOINSXPZNNMVQ5UDZNULGHAVCNFSM6AAAAABLH7Z64WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENBTGE2DANBTHE. You are receiving this because you authored the thread.Message ID: @.**@.>>