gergness / srvyr

R package to add 'dplyr'-like Syntax for Summary Statistics of Survey Data
209 stars 28 forks source link

svychisq in summarise after group_by #152

Open pholck opened 1 year ago

pholck commented 1 year ago

Possibly not an issue and I'm being stupid...but don't see a way to have some attribute of svychisq (e.g. p.value) returned for each group_by selection (presumming the group_by generates a series of r x c tables) in the summarise. I would envision something like: svyobject %>% group_by(Response.1, Response.2) %>% summarise( test=svychisq(.)$p.test, prop = survey_prop() )

Again, apologies if I'm missing something.

bschneidr commented 1 year ago

Hi @pholck, The various statistical testing functions such as svychisq(), svyttest(), or regTermTest() don't have a "srvyr" wrapper. But you can still use them inside the summarize() function. The cur_svy() function is especially helpful for this, as it lets you access the survey design object corresponding to a group of data created by the group_by() function.

Here's an example, where a Chi-Squared test is conducted separately for each group. The output of the test in a given group is turned into a summary data frame using the tidy() function from the broom package. The summary data frame is then stored in a list column.

library(survey)
library(srvyr)

# Examples from ?survey::svydesign
library(survey)
data(api)

# Create survey design object
dstrata <- apistrat %>%
  as_survey_design(strata = stype, weights = pw)

# Group by `stype` and summarize
dstrata %>%
  group_by(stype) %>%
  summarise(
    prop = survey_prop(),
    chisq_test = svychisq(formula = ~ awards + sch.wide,
                          design = cur_svy()) |> 
      broom::tidy()
  )
#> Multiple parameters; naming those columns ndf, ddf
#> Multiple parameters; naming those columns ndf, ddf
#> Multiple parameters; naming those columns ndf, ddf
#> # A tibble: 3 × 4
#>   stype  prop prop_se chisq_test$ndf  $ddf $statistic    $p.value $method       
#>   <fct> <dbl>   <dbl>          <dbl> <dbl>      <dbl>       <dbl> <chr>         
#> 1 E     0.714       0              1    99       28.0 0.000000722 Pearson's X^2…
#> 2 H     0.122       0              1    49       23.6 0.0000126   Pearson's X^2…
#> 3 M     0.164       0              1    49       21.5 0.0000261   Pearson's X^2…

Created on 2023-01-22 by the reprex package (v2.0.1)

bschneidr commented 1 year ago

This might be worth documenting in the overview vignette, as I can see this being useful for others.

My only reservation is that the tidyverse team has marked cur_data() (the model for cur_svy()) as superseded by a new function pick() they'll be releasing, and it looks like eventually cur_data() will be soft-deprecated. So I don't know yet how this will affect our ability to support cur_svy().

pholck commented 1 year ago

Thanks very much Ben for your quick and informative response. I had no idea about cur_svy() or that it could be utilized. That seems to be just the ticket, and I'll give it a try.

I wholeheartedly think putting some mention in the vignette about this could be useful to other folks like myself (the vignette is a very useful tool). But I do understand about the deprecation possibility being an issue as well.

Thank you - Peter

On Sun, Jan 22, 2023 at 9:09 PM Ben Schneider @.***> wrote:

Hi @pholck https://github.com/pholck, The various statistical testing functions such as svychisq(), svyttest(), or regTermTest() don't have a "srvyr" wrapper. But you can still use them inside the summarize() function. The cur_svy() function http://gdfe.co/srvyr/reference/cur_svy.html is especially helpful for this, as it lets you access the survey design object corresponding to a group of data created by the group_by() function.

Here's an example, where a Chi-Squared test is conducted separately for each group. The output of the test in a given group is turned into a summary data frame using the tidy() function from the broom package. The summary data frame is then stored in a list column.

library(survey)

library(srvyr)

Examples from ?survey::svydesign

library(survey)

data(api)

Create survey design object

dstrata <- apistrat %>%

as_survey_design(strata = stype, weights = pw)

Group by stype and summarize

dstrata %>%

group_by(stype) %>%

summarise(

prop = survey_prop(),

chisq_test = svychisq(formula = ~ awards + sch.wide,

                      design = cur_svy()) |>

  broom::tidy()

)

> Multiple parameters; naming those columns ndf, ddf

> Multiple parameters; naming those columns ndf, ddf

> Multiple parameters; naming those columns ndf, ddf

> # A tibble: 3 × 4

> stype prop prop_se chisq_test$ndf $ddf $statistic $p.value $method

>

> 1 E 0.714 0 1 99 28.0 0.000000722 Pearson's X^2…

> 2 H 0.122 0 1 49 23.6 0.0000126 Pearson's X^2…

> 3 M 0.164 0 1 49 21.5 0.0000261 Pearson's X^2…

Created on 2023-01-22 by the reprex package https://reprex.tidyverse.org (v2.0.1)

— Reply to this email directly, view it on GitHub https://github.com/gergness/srvyr/issues/152#issuecomment-1399650046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO7FU6NOCUAKXRL7RZI5HN3WTXD33ANCNFSM6AAAAAAUDDS55M . You are receiving this because you were mentioned.Message ID: @.***>

pholck commented 9 months ago

Hi Ben - you were kind enough in the past to answer a question for me about your excellent srvyr package. And so I've come back with a different question!

survey_prop is a wrapper I believe for survey:svyciprop(). As you know, that function supports several different methods for CI computation, including both logit and xlogit methods (among the several others). It seems to me however that survey_prop does not support the xlogit method (either giving the same results as the logit method, or not giving valid results). Is that correct, and if so, is it likely to change (be implemented) at some point?

Reason I ask is that the xlogit method seems to be the "standard" or default CI computation method that SUDAAN (and thus the CDC) uses for complex sample survey data (at least BRFSS/YRBS/PRAMs and CUBS surveys). And so when folks are trying to match results in SAS or SUDAAN to what I generate using survey_prop they wonder why the confidence interval estimates are slightly different.

Thanks for any insight!

Peter

Peter Holck, PhD, MPH Biostatistician @.*** 512 788 5393

On Mon, Jan 23, 2023 at 5:44 PM Peter Holck @.***> wrote:

Thanks very much Ben for your quick and informative response. I had no idea about cur_svy() or that it could be utilized. That seems to be just the ticket, and I'll give it a try.

I wholeheartedly think putting some mention in the vignette about this could be useful to other folks like myself (the vignette is a very useful tool). But I do understand about the deprecation possibility being an issue as well.

Thank you - Peter

On Sun, Jan 22, 2023 at 9:09 PM Ben Schneider @.***> wrote:

Hi @pholck https://github.com/pholck, The various statistical testing functions such as svychisq(), svyttest(), or regTermTest() don't have a "srvyr" wrapper. But you can still use them inside the summarize() function. The cur_svy() function http://gdfe.co/srvyr/reference/cur_svy.html is especially helpful for this, as it lets you access the survey design object corresponding to a group of data created by the group_by() function.

Here's an example, where a Chi-Squared test is conducted separately for each group. The output of the test in a given group is turned into a summary data frame using the tidy() function from the broom package. The summary data frame is then stored in a list column.

library(survey)

library(srvyr)

Examples from ?survey::svydesign

library(survey)

data(api)

Create survey design object

dstrata <- apistrat %>%

as_survey_design(strata = stype, weights = pw)

Group by stype and summarize

dstrata %>%

group_by(stype) %>%

summarise(

prop = survey_prop(),

chisq_test = svychisq(formula = ~ awards + sch.wide,

                      design = cur_svy()) |>

  broom::tidy()

)

> Multiple parameters; naming those columns ndf, ddf

> Multiple parameters; naming those columns ndf, ddf

> Multiple parameters; naming those columns ndf, ddf

> # A tibble: 3 × 4

> stype prop prop_se chisq_test$ndf $ddf $statistic $p.value $method

>

> 1 E 0.714 0 1 99 28.0 0.000000722 Pearson's X^2…

> 2 H 0.122 0 1 49 23.6 0.0000126 Pearson's X^2…

> 3 M 0.164 0 1 49 21.5 0.0000261 Pearson's X^2…

Created on 2023-01-22 by the reprex package https://reprex.tidyverse.org (v2.0.1)

— Reply to this email directly, view it on GitHub https://github.com/gergness/srvyr/issues/152#issuecomment-1399650046, or unsubscribe https://github.com/notifications/unsubscribe-auth/AO7FU6NOCUAKXRL7RZI5HN3WTXD33ANCNFSM6AAAAAAUDDS55M . You are receiving this because you were mentioned.Message ID: @.***>

bschneidr commented 2 days ago

Hi @pholck,

It's hard to say anything without a small reproducible example to illustrate any unexpected results you're seeing. I would recommend writing up an example of what you got from SUDAAN and what you're getting in R, using a small dataset. That way any problems can be quickly understood by a developer.

Here is an illustration of a reproducible example:

library(dplyr)
library(srvyr)

data(api, package = "survey")

dstrata <- apistrat %>%
  as_survey_design(strata = stype, weights = pw)

dstrata |>
  group_by(sch.wide) |>
  summarize(
    pct = survey_prop(prop_method = "xlogit")
  )
#> When `proportion` is unspecified, `survey_prop()` now defaults to `proportion = TRUE`.
#> ℹ This should improve confidence interval coverage.
#> This message is displayed once per session.
#> # A tibble: 2 × 3
#>   sch.wide   pct pct_se
#>   <fct>    <dbl>  <dbl>
#> 1 No       0.172 0.0248
#> 2 Yes      0.828 0.0248

You might find the following short YouTube clip helpful for illustrating how to get a reproducible example in RStudio and copy the output so you can paste it into GitHub or an email:

https://www.youtube.com/watch?v=3R79dGkU5rQ

If you find there are unexpected results from svyciprop(..., method = "xlogit"), you could reach out to Thomas Lumley with a reproducible example and he could potentially give you an explanation or look into a bug if there's one.

bschneidr commented 2 days ago

@gergness I think it's safe to close this issue, since there's not really anything actionable for 'srvyr' on it now.