Open szimmer opened 2 years ago
If I had a time machine and could set it this way from the start I think I agree. I'm less sure that I should do it now that it would change existing code, but maybe it's not so bad.
Using github search, I don't think anyone has specified prop
, so it would change code, though possibly for the better.
https://github.com/search?l=R&q=survey_prop&type=Code
@bschneidr (or anyone else following), do you have an opinion?
Maybe I could change but borrow the warning tools from tidyverse, like they do when for summarize
when no .groups is specified.
mtcars %>% group_by(cyl, am) %>% summarize(n = n())
#> `summarise()` has grouped output by 'cyl'. You can override using the `.groups`
#> argument.
I think this is a good suggestion, @szimmer.
My sense is that when someone chooses to use survey_prop()
rather than survey_mean()
, it's because they're trying to (a) write code whose intent is easier for readers to understand, and (b) use a function that's presumably more statistically appropriate for proportions. Changing the default value to proportion = TRUE
would make survey_prop()
more helpful for (b).
Making this update would change code, but I think it's generally for the better. The default "logit" method used by svyciprop()
may not be the best default method, but it should be generally better than the simple Wald method used by svymean()
.
The Wald interval method has long been known to have coverage issues with complex surveys, and a recent simulation study had some pretty strong recommendations against its use:
We have seen that the Wald CI is badly flawed for estimating proportions in complex surveys due to its severe undercoverage in a variety of situations. Improving the estimation of sampling variance does not salvage the Wald interval, which performs poorly even when the true sampling variance is known... Even when our method cannot be used, a strong recommendation still emerges from our simulations: the Wald interval is not to be used and should be replaced by the preferred non-Wald method...
Carolina Franco et al. 2019 "Comparative Study of Confidence Intervals for Proportions in Complex Sample Surveys", Journal of Survey Statistics and Methodology https://doi.org/10.1093/jssam/smy019
I guess the only concern here is users might be surprised if their old analysis results become harder to reproduce. I think a temporary warning to use for the next release could be good. Something like:
survey_prop <- function(....) {
if (missing(proportion)) {
warning("When `proportion` is unspecified, `survey_prop()` now defaults to `proportion = TRUE`. This should improve confidence interval coverage.")
}
}
But using the tidyverse warning tools to only show this once per session.
I make an issue and start a discussion and then go on vacation! I agree with the warning and can implement later this week if no one else jumps on it first.
The type of interval to use as a default is a good question. FWIW, SUDAAN and SAS use xlogit as their default.
survey_mean
andsurvey_prop
are vary similar. I feel, based on the function name,survey_prop
should default toproportion=TRUE
. Thoughts?