gergness / srvyr

R package to add 'dplyr'-like Syntax for Summary Statistics of Survey Data
209 stars 28 forks source link

Add a correlation function - feature request #150

Closed szimmer closed 1 year ago

szimmer commented 1 year ago

Correlation is a common descriptive statistic used in surveys. This is possible in survey but not srvyr.

Perhaps output similar to corrr::correlate(): https://www.tidyverse.org/blog/2020/12/corrr-0-4-3/

Related discussion: https://community.rstudio.com/t/correlation-spearman-for-complex-survey-samples/150255

bschneidr commented 1 year ago

Basic implementation example

Here's a basic implementation of a survey_corr() function, which is very similar in style to survey_var() or survey_sd().

library(srvyr)

data('api', package = 'survey')

apisrs |> as_survey_design(.ids = 1) |>
   summarize(api_corr = survey_corr(x = api00, y = api99))

#> # A tibble: 1 × 2
#>   api_corr api_corr_se
#>      <dbl>       <dbl>
#> 1    0.975     0.00461

Implementation details

The implementation idea is to use svyvar() under the hood to get point estimates for covariances and their sampling variance-covariance matrix, then use svycontrast() to get the sampling variance estimates of the correlation derived from the covariance point estimates. The same warnings about confidence intervals made in the current documentation for survey_var() and survey_sd() would apply to this new function, too.

Click to show full code ``` r library(srvyr) data('api', package = 'survey') # Define a 'srvyr' function for estimating correlation coefficient ---- survey_corr <- function( x, y, na.rm = FALSE, vartype = c("se", "ci", "var", "cv"), ... ) { if (missing(vartype)) vartype <- "se" vartype <- match.arg(vartype, several.ok = TRUE) # Add the necessary variables to the survey design .svy <- srvyr::set_survey_vars(srvyr::cur_svy(), x, name = "__SRVYR_TEMP_VAR_X__") .svy <- srvyr::set_survey_vars(.svy, y, name = "__SRVYR_TEMP_VAR_Y__", add = TRUE) # Get point estimates for population variances and covariance if (inherits(.svy, 'svyrep.design')) { var_est <- survey:::svyvar( x = ~ `__SRVYR_TEMP_VAR_X__` + `__SRVYR_TEMP_VAR_Y__`, na.rm = na.rm, design = .svy, return.replicates = TRUE) } else { var_est <- survey:::svyvar( x = ~ `__SRVYR_TEMP_VAR_X__` + `__SRVYR_TEMP_VAR_Y__`, na.rm = na.rm, design = .svy) } # Turn matrix of point estimates into a vector # (and deduplicate) point_estimates <- as.vector(coef(var_est)[c(1,2,4)]) names(point_estimates) <- c('var_x', 'cov_xy', 'var_y') # Obtain sampling variance-covariance matrix of the point estimates vcov_mat <- vcov(var_est)[c(1,2,4), c(1,2,4)] rownames(vcov_mat) <- names(point_estimates) colnames(vcov_mat) <- names(point_estimates) class(point_estimates) <- "svystat" attr(point_estimates, 'var') <- vcov_mat attr(point_estimates, 'statistic') <- "covariances" # Wrap it up into a 'svystat' object if (inherits(.svy, 'svyrep.design')) { svstat <- list( 'covariances' = point_estimates, 'replicates' = (var_est$replicates[,c(1,2,4)]) |> `colnames<-`(value = names(point_estimates)) ) attr(svstat$replicates, 'scale') <- attr(var_est$replicates, 'scale') attr(svstat$replicates, 'rscales') <- attr(var_est$replicates, 'rscales') attr(svstat$replicates, 'mse') <- attr(var_est$replicates, 'mse') class(svstat) <- "svrepstat" } else { svstat <- point_estimates } # Estimate correlation and use Delta Method to obtain standard error out <- survey::svycontrast( stat = svstat, contrasts = list( 'corr' = quote( cov_xy / sqrt(var_x * var_y) ) )) # Wrap up into a 'srvyr' object out <- srvyr::get_var_est(out, vartype) srvyr::as_srvyr_result_df(out) } # Try out the new function ---- apisrs |> as_survey_design(.ids = 1) |> summarize(api_corr = survey_corr(x = api00, y = api99, na.rm = TRUE)) #> # A tibble: 1 × 2 #> api_corr api_corr_se #> #> 1 0.975 0.00461 apisrs |> as_survey_design(.ids = 1) |> as_survey_rep(type = "JK1") |> summarize(api_corr = survey_corr(x = api00, y = api99, na.rm = TRUE)) #> # A tibble: 1 × 2 #> api_corr api_corr_se #> #> 1 0.975 0.00463 ```
Sanity check of results ``` r library(survey) library(srvyr) data('api', package = 'survey') jackknife_design <- apisrs |> as_survey_design(.ids = 1) |> as_survey_rep(type = "JK1") withReplicates( design = jackknife_design, theta = function(weights, data) { # Weighted covariance and standard deviations x_dev <- (data$api00 - weighted.mean(data$api00, weights)) y_dev <- (data$api99 - weighted.mean(data$api99, weights)) cov <- sum(weights * (x_dev * y_dev)) / (sum(weights) - 1) sd_x <- sqrt(sum(weights * (x_dev^2))/(sum(weights) - 1)) sd_y <- sqrt(sum(weights * (y_dev^2))/(sum(weights) - 1)) # Calculate correlation corr <- cov/(sd_x * sd_y) return(corr) } ) #> theta SE #> [1,] 0.97549 0.0046 ```

Some thoughts?

Correlation statistics other than Pearson's

This is all pretty straightforward for the Pearson correlation. The Spearman correlation discussed in that "community.rstudio" discussion is trickier. I would think you could maybe get a reasonable estimate by using the Pearson correlation method implemented here, but applying it to the sample ranks rather than the raw values of the variables (since that's one way to calculate Spearman's rho). But I'd be uncomfortable implementing a Spearman correlation method without having a good reference to back up the use of this for complex samples; maybe there's some nuance here that doesn't occur to me.

Interface

This proposed implementation is different from the corrr interface you mentioned. Personally, I'm not wild about using the 'corrr' interface in a complex survey context. But I get why others might like using it.

Perhaps a feature request could be made to the 'corrr' package to make colpair_map() an S3 generic. Then maybe it would be possible to do something ilke colpair_map(my_survey_design, survey_corr) to get a correlation matrix in the usual corrr style.

But in general, getting a full correlation matrix might just be something where it's preferable to use the 'survey' package. IMO, matrix outputs are just not something that really fits naturally in the dplyr / tidy framework. FWIW, here's the code I would use to get a Spearman correlation matrix from complex survey data.

Click to show R code ```r library(survey) library(srvyr) data('api', package = 'survey') # Create a survey design object (tbl_svy) cov_matrix <- apisrs |> as_survey_design(.ids = 1) |> # Convert variables to ranks suitable for Spearman's correlation mutate(across(c(api00, api99), rank, na.last = "keep", ties.method = "average")) |> # Estimate the population variance-covariance of the ranks svyvar(x = ~ api00 + api99) |> # Extract estimated variance-covariance matrix coef() # Convert to correlation matrix cov_matrix |> cov2cor() #> api00 api99 #> api00 1.0000000 0.9735847 #> api99 0.9735847 1.0000000 # Use `corrr` function to reformat to dataframe cov_matrix |> cov2cor() |> corrr::as_cordf() #> # A tibble: 2 × 3 #> term api00 api99 #> #> 1 api00 NA 0.974 #> 2 api99 0.974 NA # Sanity Check: apisrs[,c('api00', 'api99')] |> cor(method = "spearman") #> api00 api99 #> api00 1.0000000 0.9735847 #> api99 0.9735847 1.0000000 ```
gergness commented 1 year ago

Sorry to just leave this hanging! This was our first year in a school with a full 2 weeks off for winter break so it wasn't quite like the old days when I'd get a lot of projects done in that time.

I think the survey_corr() fits better into existing srvyr functionality, but am also interested in seeing what it would look like to mimic the corrr API. Might be a way to address concerns in #75

gergness commented 1 year ago

closed via #151