gergness / srvyr

R package to add 'dplyr'-like Syntax for Summary Statistics of Survey Data
214 stars 28 forks source link

Add information about the different degrees of freedom default to the `srvyr` vs `survey` vignette #119

Open florisvdh opened 3 years ago

florisvdh commented 3 years ago

It appears that the 95% confidence limits from survey_mean() are slightly different from those obtained with survey::svymean() - is this a bug?

suppressPackageStartupMessages({
    library(survey)
    library(srvyr)
})
data(api)

srs_design <- apisrs %>% svydesign(id = ~1, fpc = ~fpc, data = .)
api_mean <- svymean(~enroll, srs_design)
(svy_ci <- confint(api_mean))
#>          2.5 %  97.5 %
#> enroll 530.969 638.251

srs_design_srvyr <- apisrs %>% as_survey_design(ids = 1, fpc = fpc)
(srvyr_ci <- 
    srs_design_srvyr %>% 
    summarise(enroll = survey_mean(enroll, vartype = "ci")))
#>   enroll enroll_low enroll_upp
#> 1 584.61   530.6408   638.5792

svy_ci[1,1] == srvyr_ci$enroll_low
#> [1] FALSE
svy_ci[1,2] == srvyr_ci$enroll_upp
#> [1] FALSE

svy_ci[1,1] == coef(api_mean) - qnorm(0.975) * SE(api_mean)
#>        enroll
#> enroll   TRUE
svy_ci[1,2] == coef(api_mean) + qnorm(0.975) * SE(api_mean)
#>        enroll
#> enroll   TRUE

Created on 2021-05-07 by the reprex package (v2.0.0)

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.0.5 (2021-03-31) #> os Linux Mint 20 #> system x86_64, linux-gnu #> ui X11 #> language nl_BE:nl #> collate nl_BE.UTF-8 #> ctype nl_BE.UTF-8 #> tz Europe/Brussels #> date 2021-05-07 #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.0.2) #> cli 2.4.0 2021-04-05 [1] CRAN (R 4.0.5) #> crayon 1.4.1 2021-02-08 [1] CRAN (R 4.0.3) #> DBI 1.1.1 2021-01-15 [1] CRAN (R 4.0.3) #> digest 0.6.27 2020-10-24 [1] CRAN (R 4.0.3) #> dplyr 1.0.5 2021-03-05 [1] CRAN (R 4.0.5) #> ellipsis 0.3.1 2020-05-15 [1] CRAN (R 4.0.2) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.0.2) #> fansi 0.4.2 2021-01-15 [1] CRAN (R 4.0.3) #> fs 1.5.0 2020-07-31 [1] CRAN (R 4.0.2) #> generics 0.1.0 2020-10-31 [1] CRAN (R 4.0.3) #> glue 1.4.2 2020-08-27 [1] CRAN (R 4.0.2) #> highr 0.8 2019-03-20 [1] CRAN (R 4.0.2) #> htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.0.3) #> knitr 1.31 2021-01-27 [1] CRAN (R 4.0.3) #> lattice 0.20-41 2020-04-02 [4] CRAN (R 4.0.0) #> lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.0.4) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.0.3) #> Matrix * 1.3-2 2021-01-06 [4] CRAN (R 4.0.3) #> mitools 2.4 2019-04-26 [1] CRAN (R 4.0.5) #> pillar 1.5.1 2021-03-05 [1] CRAN (R 4.0.5) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.0.2) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.0.2) #> R6 2.5.0 2020-10-28 [1] CRAN (R 4.0.3) #> reprex 2.0.0 2021-04-02 [1] CRAN (R 4.0.5) #> rlang 0.4.10 2020-12-30 [1] CRAN (R 4.0.3) #> rmarkdown 2.7 2021-02-19 [1] CRAN (R 4.0.4) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.0.3) #> sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.0.2) #> srvyr * 1.0.1 2021-03-28 [1] CRAN (R 4.0.5) #> stringi 1.5.3 2020-09-09 [1] CRAN (R 4.0.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.0.2) #> survey * 4.0 2020-04-03 [1] CRAN (R 4.0.5) #> survival * 3.2-10 2021-03-16 [4] CRAN (R 4.0.4) #> tibble 3.1.0 2021-02-25 [1] CRAN (R 4.0.5) #> tidyr 1.1.3 2021-03-03 [1] CRAN (R 4.0.5) #> tidyselect 1.1.0 2020-05-11 [1] CRAN (R 4.0.2) #> utf8 1.2.1 2021-03-12 [1] CRAN (R 4.0.5) #> vctrs 0.3.7 2021-03-29 [1] CRAN (R 4.0.5) #> withr 2.4.1 2021-01-26 [1] CRAN (R 4.0.3) #> xfun 0.22 2021-03-11 [1] CRAN (R 4.0.4) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.0.2) #> #> [1] /home/floris/lib/R/library #> [2] /usr/local/lib/R/site-library #> [3] /usr/lib/R/site-library #> [4] /usr/lib/R/library ```
gergness commented 3 years ago

Not a bug (though a little confusing, I admit). srvyr tries to be helpful by setting the df on confint for you.

If you prefer the results you got using the survey package, you can do this in srvyr:

(srvyr_ci_2 <- 
        srs_design_srvyr %>% 
        summarise(enroll = survey_mean(enroll, vartype = "ci", df = Inf)))
#>   enroll enroll_low enroll_upp
#> 1 584.61    530.969    638.251

Or to get the srvyr results in survey:

api_mean_2 <- svymean(~enroll, srs_design)
(svy_ci_2 <- confint(api_mean_2, df = degf(srs_design)))
#>           2.5 %   97.5 %
#> enroll 530.6408 638.5792
florisvdh commented 3 years ago

Thank you for explaining @gergness. And indeed the df argument is present in the documentation of survey_mean() :+1:.

I think it would be good to have this aspect added in the srvyr-vs-survey vignette. The difference can be seen under heading Summary statistics, but it's not discussed.

> svy_ci_2[1,1] == srvyr_ci$enroll_low
[1] TRUE
> svy_ci_2[1,2] == srvyr_ci$enroll_upp
[1] TRUE
> 
> srvyr_ci$enroll_low == coef(api_mean) - qt(0.975, df = degf(srs_design)) * SE(api_mean)
       enroll
enroll   TRUE
> srvyr_ci$enroll_upp == coef(api_mean) + qt(0.975, df = degf(srs_design)) * SE(api_mean)
       enroll
enroll   TRUE
> 
> qt(0.975, df=Inf) == qnorm(0.975)
[1] TRUE
gergness commented 3 years ago

yeah, makes sense