gergness / srvyr

R package to add 'dplyr'-like Syntax for Summary Statistics of Survey Data
214 stars 28 forks source link

Opinion: survey_prop should default to proportion = TRUE #141

Open szimmer opened 2 years ago

szimmer commented 2 years ago

survey_mean and survey_prop are vary similar. I feel, based on the function name, survey_prop should default to proportion=TRUE. Thoughts?

gergness commented 2 years ago

If I had a time machine and could set it this way from the start I think I agree. I'm less sure that I should do it now that it would change existing code, but maybe it's not so bad.

Using github search, I don't think anyone has specified prop, so it would change code, though possibly for the better. https://github.com/search?l=R&q=survey_prop&type=Code

@bschneidr (or anyone else following), do you have an opinion?

Maybe I could change but borrow the warning tools from tidyverse, like they do when for summarize when no .groups is specified.

mtcars %>% group_by(cyl, am) %>% summarize(n = n())
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
#> argument.
bschneidr commented 2 years ago

I think this is a good suggestion, @szimmer.

My sense is that when someone chooses to use survey_prop() rather than survey_mean(), it's because they're trying to (a) write code whose intent is easier for readers to understand, and (b) use a function that's presumably more statistically appropriate for proportions. Changing the default value to proportion = TRUE would make survey_prop() more helpful for (b).

Making this update would change code, but I think it's generally for the better. The default "logit" method used by svyciprop() may not be the best default method, but it should be generally better than the simple Wald method used by svymean().

The Wald interval method has long been known to have coverage issues with complex surveys, and a recent simulation study had some pretty strong recommendations against its use:

We have seen that the Wald CI is badly flawed for estimating proportions in complex surveys due to its severe undercoverage in a variety of situations. Improving the estimation of sampling variance does not salvage the Wald interval, which performs poorly even when the true sampling variance is known... Even when our method cannot be used, a strong recommendation still emerges from our simulations: the Wald interval is not to be used and should be replaced by the preferred non-Wald method...

Carolina Franco et al. 2019 "Comparative Study of Confidence Intervals for Proportions in Complex Sample Surveys", Journal of Survey Statistics and Methodology https://doi.org/10.1093/jssam/smy019

I guess the only concern here is users might be surprised if their old analysis results become harder to reproduce. I think a temporary warning to use for the next release could be good. Something like:

survey_prop <- function(....) {
  if (missing(proportion)) {
  warning("When `proportion` is unspecified, `survey_prop()` now defaults to `proportion = TRUE`. This should improve confidence interval coverage.")
  }
}

But using the tidyverse warning tools to only show this once per session.

szimmer commented 2 years ago

I make an issue and start a discussion and then go on vacation! I agree with the warning and can implement later this week if no one else jumps on it first.

The type of interval to use as a default is a good question. FWIW, SUDAAN and SAS use xlogit as their default.