Closed tylcole closed 2 years ago
There are lots of different glm()
options that could be added. Poisson/weights/propensity score as you rightly say. Rather than make things more complicated, ff_merge()
allows any table to be created. Here are some examples of going from a simple logistic regression model, to a weighted model.
Standard approach:
explanatory = c("age.factor", "age", "sex.factor", "nodes", "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
colon_s %>%
finalfit(dependent, explanatory)
Build each column separately approach:
colon_s %>%
summary_factorlist(dependent, explanatory, fit_id=TRUE) %>%
ff_merge(
glmuni(colon_s, dependent, explanatory) %>%
fit2df(estimate_suffix = " (univariable)")
) %>%
ff_merge(
glmmulti(colon_s, dependent, explanatory) %>%
fit2df(estimate_suffix = " (multivariable)")
) %>%
dplyr::select(-fit_id, -index) %>%
dependent_label(colon_s, dependent)
Incorporate more complex model:
colon_s %>%
summary_factorlist(dependent, explanatory, fit_id=TRUE) %>%
ff_merge(
glmuni(colon_s, dependent, explanatory) %>%
fit2df(estimate_suffix = " (univariable)")
) %>%
ff_merge(
glm(ff_formula(dependent, explanatory), data=colon_s,
family = "binomial", weights = NULL) %>%
fit2df(estimate_suffix = " (multivariable)")
) %>%
dplyr::select(-fit_id, -index) %>%
dependent_label(colon_s, dependent)
Let me know what you think.
That's great for multivariate models, thanks!
What do you think about weighted univariate stats? Would that require using different functions within the glmuni() function?
lmuni()
, lmmulti()
, glmuni()
and glmmulti()
now supported weights.
explanatory = c("age.factor", "age", "sex.factor", "nodes", "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
wgts = runif(dim(colon_s)[1], 0, 1)
library(finalfit)
colon_s %>%
summary_factorlist(dependent, explanatory, fit_id=TRUE) %>%
ff_merge(
glmuni(colon_s, dependent, explanatory, weights = wgts) %>%
fit2df(estimate_suffix = " (univariable)")
) %>%
ff_merge(
glm(ff_formula(dependent, explanatory), data=colon_s,
family = "binomial", weights = wgts) %>%
fit2df(estimate_suffix = " (multivariable)")
) %>%
dplyr::select(-fit_id, -index) %>%
dependent_label(colon_s, dependent)
That's incredible! Also supporting strata and clusters (as in function svydesign) would be amazing.
On Mon, Mar 25, 2019 at 07:40 ewenharrison notifications@github.com wrote:
Closed #13 https://github.com/ewenharrison/finalfit/issues/13.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ewenharrison/finalfit/issues/13#event-2226757364, or mute the thread https://github.com/notifications/unsubscribe-auth/AL7HSB7jMXLIy0Rt446LtbkjlnitSaQuks5vaN_fgaJpZM4aI4lV .
Thanks. Could you provide an example of what you suggest.
Certainly, an example is in here: https://www.hcup-us.ahrq.gov/reports/methods/2017-01.pdf In the US, this is a commonly used set of databases.
From page 10 of the pdf:
install.packages("survey"); library(survey)
strnrddsgn <- svydesign(ids = ~HOSP_NRD, weights = ~DISCWT, strata = ~NRD_STRATUM, data = core)
On Wed, Mar 27, 2019 at 4:26 AM ewenharrison notifications@github.com wrote:
Closed #13 https://github.com/ewenharrison/finalfit/issues/13.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ewenharrison/finalfit/issues/13#event-2232756977, or mute the thread https://github.com/notifications/unsubscribe-auth/AL7HSLg2SO3q1wCT9jnzAy27yTxqpPKiks5va1VJgaJpZM4aI4lV .
Hi,
I don't know too much about this area. I've added these functions and would appreciate if you could let me know if this is what you were thinking of. Also, are the examples correct.
Many thanks for your help with this.
Ewen
library(dplyr)
library(survey)
# Examples from survey::svyglm() help page
data(api)
# Label data frame
apistrat = apistrat %>%
mutate(
api00 = ff_label(api00, "API in 2000 (api00)"),
ell = ff_label(ell, "English language learners (percent)(ell)"),
meals = ff_label(meals, "Meals eligible (percent)(meals)"),
mobility = ff_label(mobility, "First year at the school (percent)(mobility)"),
sch.wide = ff_label(sch.wide, "School-wide target met (sch.wide)")
)
# Linear example
dependent = "api00"
explanatory = c("ell", "meals", "mobility")
# Stratified design
dstrat = svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
# Univariable fit
fit_uni = dstrat %>%
svyglmuni(dependent, explanatory) %>%
fit2df(estimate_suffix = " (univariable)")
# Multivariable fit
fit_multi = dstrat %>%
svyglmmulti(dependent, explanatory) %>%
fit2df(estimate_suffix = " (multivariable)")
# Pipe together
apistrat %>%
summary_factorlist(dependent, explanatory, fit_id = TRUE) %>%
ff_merge(fit_uni) %>%
ff_merge(fit_multi) %>%
select(-fit_id, -index) %>%
dependent_label(apistrat, dependent)
# Dependent: API in 2000 (api00) Mean (sd) Coefficient (univariable) Coefficient (multivariable)
# English language learners (percent)(ell) [0,84] 652.8 (121.0) -3.73 (-4.35--3.11, p<0.001) -0.48 (-1.25-0.29, p=0.222)
# Meals eligible (percent)(meals) [0,100] 652.8 (121.0) -3.38 (-3.71--3.05, p<0.001) -3.14 (-3.70--2.59, p<0.001)
# First year at the school (percent)(mobility) [1,99] 652.8 (121.0) -1.43 (-3.30-0.44, p=0.137) 0.23 (-0.54-1.00, p=0.567)
# Binomial example
## Note model family needs specified and exponentiation if desired
dependent = "sch.wide"
explanatory = c("ell", "meals", "mobility")
# Univariable fit
fit_uni = dstrat %>%
svyglmuni(dependent, explanatory, family = "quasibinomial") %>%
fit2df(exp = TRUE, estimate_name = "OR", estimate_suffix = " (univariable)")
# Multivariable fit
fit_multi = dstrat %>%
svyglmmulti(dependent, explanatory, family = "quasibinomial") %>%
fit2df(exp = TRUE, estimate_name = "OR", estimate_suffix = " (multivariable)")
# Pipe together
apistrat %>%
summary_factorlist(dependent, explanatory, fit_id = TRUE) %>%
ff_merge(fit_uni) %>%
ff_merge(fit_multi) %>%
select(-fit_id, -index) %>%
dependent_label(apistrat, dependent)
# Dependent: School-wide target met (sch.wide) No Yes OR (univariable) OR (multivariable)
# English language learners (percent)(ell) Mean (SD) 22.5 (19.3) 20.5 (20.0) 1.00 (0.98-1.01, p=0.715) 1.00 (0.97-1.02, p=0.851)
# Meals eligible (percent)(meals) Mean (SD) 46.0 (29.1) 44.7 (29.0) 1.00 (0.99-1.01, p=0.968) 1.00 (0.98-1.01, p=0.732)
# First year at the school (percent)(mobility) Mean (SD) 13.9 (8.6) 17.2 (13.0) 1.06 (1.00-1.12, p=0.049) 1.06 (1.00-1.13, p=0.058)
That's fantastic, I should have time in the next week to go through this and confirm. I will also test on my own data. I'll respond as soon as I do. Thank you for this great update!
Thanks. One issue is that the summary statistics from summary_factorlist()
are not weighted. They cannot be weighted using this function. So to be useful it might be necessary to incorporate svymean()
or similar.
What would be useful is to know how you would actually use this. What your final table would actually look like. What variations would be common in practice etc. Best wishes.
Ewen I just took a look and that looks correct, though like you mention a key feature that would be really nice is the weighted means and SD in the table using svymean()
. The final table would be just like you are demonstrating, except weighted mean/SD.
This is a basic example of what those tables would look like in a paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5624560/pdf/cureus-0009-00000001536.pdf
Ideally, summary_factorlist() should test if .data is a svydesign object. In such case, all stats could be computed using survey functions. i.e. svytable, svymean, svychisq, etc.
The desired table is the same as with unweighted data. It's just that the computation of them should used survey functions.
Thanks @larmarange Is it worth a new function? Would people use it?
Dear @ewenharrison thank you very much for your feedback. You have done a fantastic job on finalfit
.
I just spent 3 days teaching R to master students and teachers in population studies. They were very impressed by summary_fatorlist()
, finalfit()
and or_plot()
as they are very easy to use.
In population studies, most of them are working with national surveys. Such surveys have complex sampling and/or post-calibration. Almost all of them required to use the survey
package.
When I presented multivariate analysis using survey
, they were very frustrated not to be able to use finalfit
.
I would say it's really worth to extend summary_facorlist()
to be able to deal with a survey design object instead of a data.frame and to use therefore survey
functions (such as svytable()
, svymean()
, svychisq()
...) to generate the table.
Similarly, if summary_factorlist()
is adaped to manage survey.design
class (it's simple to test if an object inherits that class to load a specific function), then it should be quite easy to adapt finalfit()
and or_fit()
as well.
By the way, just for you to know, survey
also provide a svycoxph()
function.
Ok, well if you could point me to a suitable dataset and provide 2 or 3 example analyses using survey, together with the table output you would like to to see then I can have a look.
Many of the examples in library(survey)
help seem very complex and difficult to generalise.
Please find below an exemple using a dataset provided by questionr
package.
library(questionr)
data(hdv2003)
library(dplyr)
#>
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
hdv2003$etud <- recode_factor(
hdv2003$nivetud,
"N'a jamais fait d'etudes" = "Primary",
"A arrete ses etudes, avant la derniere annee d'etudes primaires" = "Primary",
"Derniere annee d'etudes primaires" = "Primary",
"1er cycle" = "Secondary",
"2eme cycle" = "Secondary",
"Enseignement technique ou professionnel court" = "Professionnal",
"Enseignement technique ou professionnel long" = "Professionnal",
"Enseignement superieur y compris technique superieur" = "High"
)
library(survey)
#> Le chargement a nécessité le package : grid
#> Le chargement a nécessité le package : Matrix
#> Le chargement a nécessité le package : survival
#>
#> Attachement du package : 'survey'
#> The following object is masked from 'package:graphics':
#>
#> dotchart
dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003)
# Note: summary.formula not compatible with svydesign objects
# dependent variable
freq(svytable(~ sport, dw))
#> n % val%
#> Non 6714760 60.7 60.7
#> Oui 4356466 39.3 39.3
# binary explanatory
tab <- svytable(~ sexe + sport, dw)
tab
#> sport
#> sexe Non Oui
#> Homme 2850210 2299173
#> Femme 3864550 2057293
rprop(tab)
#> sport
#> sexe Non Oui Total
#> Homme 55.4 44.6 100.0
#> Femme 65.3 34.7 100.0
#> Ensemble 60.7 39.3 100.0
svychisq(~ sexe + sport, dw)
#>
#> Pearson's X^2: Rao & Scott adjustment
#>
#> data: svychisq(~sexe + sport, dw)
#> F = 12.444, ndf = 1, ddf = 1999, p-value = 0.0004289
# explanotory with more modalities
tab <- svytable(~ etud + sport, dw)
tab
#> sport
#> etud Non Oui
#> Primary 1978619.6 234477.9
#> Secondary 1428124.8 671021.2
#> Professionnal 2050834.3 1315064.1
#> High 1071009.1 1504802.3
rprop(tab)
#> sport
#> etud Non Oui Total
#> Primary 89.4 10.6 100.0
#> Secondary 68.0 32.0 100.0
#> Professionnal 60.9 39.1 100.0
#> High 41.6 58.4 100.0
#> Ensemble 63.7 36.3 100.0
svychisq(~ etud + sport, dw)
#>
#> Pearson's X^2: Rao & Scott adjustment
#>
#> data: svychisq(~etud + sport, dw)
#> F = 45.593, ndf = 2.9831, ddf = 5963.2106, p-value < 2.2e-16
# quantitative explanatory
svyby(~ heures.tv, ~ sport, dw, svymean, na.rm = TRUE) # mean
#> sport heures.tv se
#> Non Non 2.417582 0.06689688
#> Oui Oui 1.785640 0.06573058
svyttest(heures.tv ~ sport, dw)
#>
#> Design-based t-test
#>
#> data: heures.tv ~ sport
#> t = -6.7382, df = 1993, p-value = 2.092e-11
#> alternative hypothesis: true difference in mean is not equal to 0
#> 95 percent confidence interval:
#> -0.8157581 -0.4481261
#> sample estimates:
#> difference in mean
#> -0.6319421
svyby(~ heures.tv, ~ sport, dw, svyquantile, quantiles = 1:3/4, ci = TRUE, na.rm = TRUE) # quartiles
#> Warning in vcov.svyquantile(X[[i]], ...): Only diagonal of vcov() available
#> Warning in vcov.svyquantile(X[[i]], ...): Only diagonal of vcov() available
#> sport 0.25 0.5 0.75 se.0.25 se.0.5 se.0.75
#> Non Non 1 2 3 0.0000000 0.0000000 0.2551067
#> Oui Oui 1 2 3 0.1020427 0.1371243 0.2551067
svyranktest(heures.tv ~ sport, dw)
#>
#> Design-based KruskalWallis test
#>
#> data: heures.tv ~ sport
#> t = -6.1587, df = 1993, p-value = 8.846e-10
#> alternative hypothesis: true difference in mean rank score is not equal to 0
#> sample estimates:
#> difference in mean rank score
#> -0.09905157
# binary regression
reg <- svyglm(sport ~ sexe + etud + heures.tv, family = quasibinomial(), design = dw)
questionr::odds.ratio(reg)
#> OR 2.5 % 97.5 % p
#> (Intercept) 0.18564 0.11762 0.2930 6.902e-13 ***
#> sexeFemme 0.79913 0.61522 1.0380 0.093059 .
#> etudSecondary 3.73818 2.32314 6.0151 6.261e-08 ***
#> etudProfessionnal 4.94649 3.18799 7.6750 1.398e-12 ***
#> etudHigh 10.16897 6.43759 16.0632 < 2.2e-16 ***
#> heures.tv 0.88979 0.81820 0.9677 0.006427 **
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::tidy(reg, exponentiate = TRUE, conf.int = TRUE)
#> # A tibble: 6 x 7
#> term estimate std.error statistic p.value conf.low conf.high
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 0.186 0.233 -7.23 6.90e-13 0.118 0.293
#> 2 sexeFemme 0.799 0.133 -1.68 9.31e- 2 0.615 1.04
#> 3 etudSecondary 3.74 0.243 5.43 6.26e- 8 2.32 6.02
#> 4 etudProfessionn~ 4.95 0.224 7.13 1.40e-12 3.19 7.67
#> 5 etudHigh 10.2 0.233 9.94 9.76e-23 6.44 16.1
#> 6 heures.tv 0.890 0.0428 -2.73 6.43e- 3 0.818 0.968
In terms of expected output, the idea would be to be able to do the following :
dep <- "sport"
expl <- c("sexe", "etud", "heures.tv")
library(finalfit)
dw %>% summary_factorlist(dep, expl, p = TRUE)
dw %>% finalfit(dep, expl)
The table will be exactly the same as with a classic data.frame, except that all stats would have been computed using survey
functions, taking into account sample weights and sample design.
Best regards
Just for information, if you prefer working with dplyr
syntax, there is a package called srvyr
extening dplyr
verbs to survey
objects.
https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html
With this package you can do things like:
dw <- as_survey_design(dw) # to add tbl_svy class to dw
dw %>% group_by(sport) %>% summarise(mean = survey_mean(heures.tv, na.rm = TRUE))
Here, the result will be a tibble.
Thanks Joseph for the very useful example.
This probably gets us quite far. There are summary_factorlist()
features not yet included, most notably, labels. But that would be an easy addition.
Would be tidied up and split into component functions prior to being deployed.
Interested in thoughts.
library(questionr)
data(hdv2003)
library(dplyr)
hdv2003$etud <- recode_factor(
hdv2003$nivetud,
"N'a jamais fait d'etudes" = "Primary",
"A arrete ses etudes, avant la derniere annee d'etudes primaires" = "Primary",
"Derniere annee d'etudes primaires" = "Primary",
"1er cycle" = "Secondary",
"2eme cycle" = "Secondary",
"Enseignement technique ou professionnel court" = "Professionnal",
"Enseignement technique ou professionnel long" = "Professionnal",
"Enseignement superieur y compris technique superieur" = "High"
)
library(survey)
dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003)
summary_survey <- function(.data, dependent, explanatory, cont = "mean", cont_range = FALSE,
p = FALSE, column = FALSE, digits = c(1, 1, 3, 1), fit_id = FALSE){
# Define variable type
explanatory_type = .data$variables %>%
select(explanatory) %>%
purrr::map(is.numeric)
# Hypothesis test
if(p){
p_tests = explanatory %>%
purrr::map2(explanatory_type,
~if(!.y){
survey::svychisq(as.formula(paste0("~", ., "+", dependent)), .data)$p.value %>%
p_tidy(digits[3])
} else if (.y & cont == "mean"){
survey::svyttest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>%
p_tidy(digits[3])
} else if (.y & cont == "median"){
survey::svyranktest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>%
p_tidy(digits[3])
}
)
}
# Output table
explanatory %>%
purrr::map2(explanatory_type,
~ if(!.y){
survey::svytable(as.formula(paste0("~", .x, "+", dependent)), .data) %>%
as.data.frame(stringsAsFactors = FALSE) %>%
{ if(column) {
dplyr::group_by(., !! sym(dependent)) %>%
dplyr::mutate(
total = sum(Freq),
prop = 100 * Freq / total
)
} else {
dplyr::group_by_at(., 1) %>%
dplyr::mutate(
total = sum(Freq),
prop = 100 * Freq / total
)
}
} %>%
dplyr::mutate(
value = paste0(Freq %>% round_tidy(digits[4]), " (",
prop %>% round_tidy(digits[4]), ")")
) %>%
dplyr::select(-total, -prop, -Freq) %>%
tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>%
dplyr::mutate(
label = names(.)[1]
) %>%
dplyr::rename(levels = 1)
} else {
{ if(cont == "mean") {
survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data, svymean, na.rm = TRUE) %>%
dplyr::mutate(
value = paste0(!! sym(.x) %>% round_tidy(digits[1]), " (",
se %>% round_tidy(digits[2]), ")")
) %>%
dplyr::select(-c(.x, se)) %>%
tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>%
dplyr::mutate(
label = .x,
levels = "Mean (SE)"
)
} else if(cont == "median"){
survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data,
svyquantile, quantiles = 1:3/4, ci = TRUE, na.rm = TRUE) %>%
dplyr::rename(Q1 = 2,
Q2 = 3,
Q3 = 4) %>%
dplyr::mutate(
IQR = Q3 - Q1) %>%
{ if(cont_range){
dplyr::mutate(.,
value = paste0(Q2 %>% round_tidy(digits[1]), " (",
Q1 %>% round_tidy(digits[2]), " to ",
Q3 %>% round_tidy(digits[2]), ")")
)
} else {
dplyr::mutate(.,
value = paste0(Q2 %>% round_tidy(digits[1]), " (",
IQR %>% round_tidy(digits[2]), ")")
)
}} %>%
dplyr::select(-c(2:8)) %>%
tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>%
dplyr::mutate(
label = .x,
levels = "Median (IQR)"
)
}
}
}) %>%
# Add hypothesis test
{ if(p){
purrr::map2_df(., p_tests,
~ mutate(.x,
p = .y)
)} else {
dplyr::bind_rows(.)
}} %>%
dplyr::select(label, levels, dplyr::everything()) %>%
as.data.frame() %>%
{ if(fit_id){
levels_id = .$levels
drop = levels_id %in% c("Mean (SE)", "Median (IQR)")
levels_id[drop] = ""
mutate(.,
fit_id = paste0(label, levels_id),
index = 1:dim(.)[1])
} else {
.
}} %>%
rm_duplicate_labels()
}
library(finalfit)
dependent = "sport"
# Choose different options to test
explanatory = c("sexe")
explanatory = c("heures.tv")
explanatory = c("sexe", "heures.tv", "etud")
dw %>%
summary_survey(dependent, explanatory, cont = "mean", cont_range = FALSE, p = FALSE, column = FALSE,
digits = c(1, 1, 3, 0))
ff_merge(
summary_survey(dw, dependent, explanatory, fit_id = TRUE),
svyglmmulti(dw, dependent, explanatory, family = "quasibinomial") %>%
fit2df(exp = TRUE, estimate_name = "OR (multivariable)"),
last_merge = TRUE
)
Thanks a lot. That's great!!!!
Just a small comment: for digits argument, it would be nice to consider a 5th value correspond to the number of digits for the number of observations (for example, to be able to use 0 for the number of observations while keeping 1 for percentage).
Otherwise, everything seems to behave as expected.
Thanks again
Thank you both, I had also noted the usefulness of having weighted functions. I am glad this is a feature now!
Tyler Cole
On Tue, Oct 29, 2019 at 2:17 AM Joseph notifications@github.com wrote:
Thanks a lot. That's great!!!!
Just a small comment: for digits argument, it would be nice to consider a 5th value correspond to the number of digits for the number of observations (for example, to be able to use 0 for the number of observations while keeping 1 for percentage).
Otherwise, everything seems to behave as expected.
Thanks again
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ewenharrison/finalfit/issues/13?email_source=notifications&email_token=AC7MOSCMNDVZDXNNFCKMGNLQQ75TLA5CNFSM4GRDRFK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECPY2OQ#issuecomment-547327290, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7MOSGYNEHGB7ZB4TJRTJTQQ75TLANCNFSM4GRDRFKQ .
Thank you very much for this excellent development. I also needed exactly this functionality.
I noticed that the currently Mean (SE) is used instead of (SD). Definitely, more often we need SD for reporting together with means. Unfortunately, svymean package does not provide SD but it can be derived easily from svyvar:
svyvar(~x,design)[1] %>% sqrt() or svyby(~x, ~dependent, design, svyvar, na.rm = TRUE)[2] %>% sqrt()
I would be great to have Mean (SD), which is also ensuring consistency with summary_factorlist().
Nobu Nishikiori
I very much agree with this
On Sun, Nov 24, 2019 at 10:46 nobu-nsk notifications@github.com wrote:
Thank you very much for this excellent development. I also needed exactly this functionality.
I noticed that the currently Mean (SE) is used instead of (SD). Definitely, more often we need SD for reporting together with means. Unfortunately, svymean package does not provide SD but it can be derived easily from svyvar:
svyvar(~x,design)[1] %>% sqrt() or svyby(~x, ~dependent, design, svyvar, na.rm = TRUE)[2] %>% sqrt()
I would be great to have Mean (SD), which is also ensuring consistency with summary_factorlist().
Nobu Nishikiori
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ewenharrison/finalfit/issues/13?email_source=notifications&email_token=AC7MOSGM3QAV6R5IWXY2EG3QVKOWLA5CNFSM4GRDRFK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFAOICI#issuecomment-557900809, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7MOSDKYRMTSN6OJBQ6XXLQVKOWLANCNFSM4GRDRFKQ .
-- Tyler Cole
Thanks both.
Just to confirm, there is not a library(survey)
function that will give both the mean
and the sd
in the same function?
I have re-written the original summary_factorlist()
to remove some old dependencies and incorporate requested features.
We'll get that and this out soon. Thanks again. Ewen
I only know of: https://rdrr.io/cran/jtools/man/svysd.html
On Sun, Nov 24, 2019 at 15:15 ewenharrison notifications@github.com wrote:
Thanks both. Just to confirm, there is not a library(survey) function that will give both the mean and the sd in the same function? I have re-written the original summary_factorlist() to remove some old dependencies and incorporate requested features. We'll get that and this out soon. Thanks again. Ewen
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ewenharrison/finalfit/issues/13?email_source=notifications&email_token=AC7MOSHHHCQWZMGAZNWQB3TQVLOHTA5CNFSM4GRDRFK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFAT2YY#issuecomment-557923683, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7MOSBW34EBEXJ3EQADQBTQVLOHTANCNFSM4GRDRFKQ .
-- Tyler Cole
I believe so.
Just to confirm, there is not a library(survey) function that will give both the mean and the sd in the same function? Thank you very much for incorporating the feature. Nobu
Hi @ewenharrison
Please, I am a newbie to R and I am trying to adapt your "summary_survey(dw, dependent, explanatory, fit_id = TRUE)" solution to my work with below parameters:
surveydes=svydesign(id=~psu, strata=~strata, weights=~pweight, data=data) dependent="knowledge" explanatory=c("agegroup", "married", "children")
surveydes %>% summary_survey(dependent, explanatory, cont = "mean", cont_range = FALSE, p = FALSE, column = FALSE, digits = c(1, 1, 3, 0))
Each time I do run my code, it returns: "Error in select(., explanatory) : unused argument (explanatory)"
I copied and pasted your "summary_survey" source codes in my R environment. But, I am certain there is something I have either misapplied or am not doing right. Could you please advise?
Thank you, Elhakim
Hi, not sure why it is not working for you. Do you have a reproducible example?
Thanks Joseph for the very useful example.
This probably gets us quite far. There are
summary_factorlist()
features not yet included, most notably, labels. But that would be an easy addition.Would be tidied up and split into component functions prior to being deployed.
Interested in thoughts.
library(questionr) data(hdv2003) library(dplyr) hdv2003$etud <- recode_factor( hdv2003$nivetud, "N'a jamais fait d'etudes" = "Primary", "A arrete ses etudes, avant la derniere annee d'etudes primaires" = "Primary", "Derniere annee d'etudes primaires" = "Primary", "1er cycle" = "Secondary", "2eme cycle" = "Secondary", "Enseignement technique ou professionnel court" = "Professionnal", "Enseignement technique ou professionnel long" = "Professionnal", "Enseignement superieur y compris technique superieur" = "High" ) library(survey) dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003) summary_survey <- function(.data, dependent, explanatory, cont = "mean", cont_range = FALSE, p = FALSE, column = FALSE, digits = c(1, 1, 3, 1), fit_id = FALSE){ # Define variable type explanatory_type = .data$variables %>% select(explanatory) %>% purrr::map(is.numeric) # Hypothesis test if(p){ p_tests = explanatory %>% purrr::map2(explanatory_type, ~if(!.y){ survey::svychisq(as.formula(paste0("~", ., "+", dependent)), .data)$p.value %>% p_tidy(digits[3]) } else if (.y & cont == "mean"){ survey::svyttest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>% p_tidy(digits[3]) } else if (.y & cont == "median"){ survey::svyranktest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>% p_tidy(digits[3]) } ) } # Output table explanatory %>% purrr::map2(explanatory_type, ~ if(!.y){ survey::svytable(as.formula(paste0("~", .x, "+", dependent)), .data) %>% as.data.frame(stringsAsFactors = FALSE) %>% { if(column) { dplyr::group_by(., !! sym(dependent)) %>% dplyr::mutate( total = sum(Freq), prop = 100 * Freq / total ) } else { dplyr::group_by_at(., 1) %>% dplyr::mutate( total = sum(Freq), prop = 100 * Freq / total ) } } %>% dplyr::mutate( value = paste0(Freq %>% round_tidy(digits[4]), " (", prop %>% round_tidy(digits[4]), ")") ) %>% dplyr::select(-total, -prop, -Freq) %>% tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% dplyr::mutate( label = names(.)[1] ) %>% dplyr::rename(levels = 1) } else { { if(cont == "mean") { survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data, svymean, na.rm = TRUE) %>% dplyr::mutate( value = paste0(!! sym(.x) %>% round_tidy(digits[1]), " (", se %>% round_tidy(digits[2]), ")") ) %>% dplyr::select(-c(.x, se)) %>% tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% dplyr::mutate( label = .x, levels = "Mean (SE)" ) } else if(cont == "median"){ survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data, svyquantile, quantiles = 1:3/4, ci = TRUE, na.rm = TRUE) %>% dplyr::rename(Q1 = 2, Q2 = 3, Q3 = 4) %>% dplyr::mutate( IQR = Q3 - Q1) %>% { if(cont_range){ dplyr::mutate(., value = paste0(Q2 %>% round_tidy(digits[1]), " (", Q1 %>% round_tidy(digits[2]), " to ", Q3 %>% round_tidy(digits[2]), ")") ) } else { dplyr::mutate(., value = paste0(Q2 %>% round_tidy(digits[1]), " (", IQR %>% round_tidy(digits[2]), ")") ) }} %>% dplyr::select(-c(2:8)) %>% tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% dplyr::mutate( label = .x, levels = "Median (IQR)" ) } } }) %>% # Add hypothesis test { if(p){ purrr::map2_df(., p_tests, ~ mutate(.x, p = .y) )} else { dplyr::bind_rows(.) }} %>% dplyr::select(label, levels, dplyr::everything()) %>% as.data.frame() %>% { if(fit_id){ levels_id = .$levels drop = levels_id %in% c("Mean (SE)", "Median (IQR)") levels_id[drop] = "" mutate(., fit_id = paste0(label, levels_id), index = 1:dim(.)[1]) } else { . }} %>% rm_duplicate_labels() } library(finalfit) dependent = "sport" # Choose different options to test explanatory = c("sexe") explanatory = c("heures.tv") explanatory = c("sexe", "heures.tv", "etud") dw %>% summary_survey(dependent, explanatory, cont = "mean", cont_range = FALSE, p = FALSE, column = FALSE, digits = c(1, 1, 3, 0)) ff_merge( summary_survey(dw, dependent, explanatory, fit_id = TRUE), svyglmmulti(dw, dependent, explanatory, family = "quasibinomial") %>% fit2df(exp = TRUE, estimate_name = "OR (multivariable)"), last_merge = TRUE )
Many thanks for "summary_survey", it is a very useful function. Please I would like to have the row/col total in the output table.
Many thanks for "summary_survey", it is a very useful function. Please I would like to have the row/col total in the output table.
Many thanks for writing. Yes, this is not yet implemented. I guess you would need an observed total as well as a weighted total? Best wishes, Ewen
@ewenharrison thanks for the reply Yes, it would be great to have both totals if possible.
Do you have published examples of how these tables are normally presented?
<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">
label | levels | X1 | X2 | Total -- | -- | -- | -- | -- Sexe | Féminin | 761.6 (77.1) | 23242.2 (85.7) | 24003.8(81.4) | Masculin | 226.6 (22.9) | 3881.9 (14.3) | 4108.5 (18.6) | Total | 988.2(100.0) | 27124.1(100.0) | 28112.3(100.0)
Would it be difficult to incorporating weighting into this package? This would be incredibly helpful for both administrative datasets as well as propensity score matching reporting. Thanks!