Closed szimmer closed 1 year ago
Here's a basic implementation of a survey_corr()
function, which is very similar in style to survey_var()
or survey_sd()
.
library(srvyr)
data('api', package = 'survey')
apisrs |> as_survey_design(.ids = 1) |>
summarize(api_corr = survey_corr(x = api00, y = api99))
#> # A tibble: 1 × 2
#> api_corr api_corr_se
#> <dbl> <dbl>
#> 1 0.975 0.00461
The implementation idea is to use svyvar()
under the hood to get point estimates for covariances and their sampling variance-covariance matrix, then use svycontrast()
to get the sampling variance estimates of the correlation derived from the covariance point estimates. The same warnings about confidence intervals made in the current documentation for survey_var()
and survey_sd()
would apply to this new function, too.
This is all pretty straightforward for the Pearson correlation. The Spearman correlation discussed in that "community.rstudio" discussion is trickier. I would think you could maybe get a reasonable estimate by using the Pearson correlation method implemented here, but applying it to the sample ranks rather than the raw values of the variables (since that's one way to calculate Spearman's rho). But I'd be uncomfortable implementing a Spearman correlation method without having a good reference to back up the use of this for complex samples; maybe there's some nuance here that doesn't occur to me.
This proposed implementation is different from the corrr
interface you mentioned. Personally, I'm not wild about using the 'corrr' interface in a complex survey context. But I get why others might like using it.
Perhaps a feature request could be made to the 'corrr' package to make colpair_map()
an S3 generic. Then maybe it would be possible to do something ilke colpair_map(my_survey_design, survey_corr)
to get a correlation matrix in the usual corrr
style.
But in general, getting a full correlation matrix might just be something where it's preferable to use the 'survey' package. IMO, matrix outputs are just not something that really fits naturally in the dplyr / tidy framework. FWIW, here's the code I would use to get a Spearman correlation matrix from complex survey data.
Sorry to just leave this hanging! This was our first year in a school with a full 2 weeks off for winter break so it wasn't quite like the old days when I'd get a lot of projects done in that time.
I think the survey_corr()
fits better into existing srvyr
functionality, but am also interested in seeing what it would look like to mimic the corrr
API. Might be a way to address concerns in #75
closed via #151
Correlation is a common descriptive statistic used in surveys. This is possible in
survey
but notsrvyr
.Perhaps output similar to corrr::correlate(): https://www.tidyverse.org/blog/2020/12/corrr-0-4-3/
Related discussion: https://community.rstudio.com/t/correlation-spearman-for-complex-survey-samples/150255