tylcole commented 5 years ago

Would it be difficult to incorporating weighting into this package? This would be incredibly helpful for both administrative datasets as well as propensity score matching reporting. Thanks!

ewenharrison commented 5 years ago

There are lots of different glm() options that could be added. Poisson/weights/propensity score as you rightly say. Rather than make things more complicated, ff_merge() allows any table to be created. Here are some examples of going from a simple logistic regression model, to a weighted model.

Standard approach:

explanatory = c("age.factor", "age", "sex.factor", "nodes", "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
colon_s %>%
    finalfit(dependent, explanatory)

Build each column separately approach:

colon_s %>% 
    summary_factorlist(dependent, explanatory, fit_id=TRUE) %>% 
    ff_merge(
        glmuni(colon_s, dependent, explanatory) %>% 
            fit2df(estimate_suffix = " (univariable)")
    ) %>% 
    ff_merge(
        glmmulti(colon_s, dependent, explanatory) %>%
            fit2df(estimate_suffix = " (multivariable)")
    ) %>% 
    dplyr::select(-fit_id, -index) %>% 
    dependent_label(colon_s, dependent)

Incorporate more complex model:

colon_s %>% 
    summary_factorlist(dependent, explanatory, fit_id=TRUE) %>% 
    ff_merge(
        glmuni(colon_s, dependent, explanatory) %>% 
            fit2df(estimate_suffix = " (univariable)")
    ) %>% 
    ff_merge(
        glm(ff_formula(dependent, explanatory), data=colon_s, 
                family = "binomial", weights = NULL) %>% 
            fit2df(estimate_suffix = " (multivariable)")
    ) %>% 
    dplyr::select(-fit_id, -index) %>% 
    dependent_label(colon_s, dependent)

Let me know what you think.

tylcole commented 5 years ago

That's great for multivariate models, thanks!

What do you think about weighted univariate stats? Would that require using different functions within the glmuni() function?

ewenharrison commented 5 years ago

lmuni(), lmmulti(), glmuni() and glmmulti() now supported weights.

explanatory = c("age.factor", "age", "sex.factor", "nodes", "obstruct.factor", "perfor.factor")
dependent = 'mort_5yr'
wgts = runif(dim(colon_s)[1], 0, 1)

library(finalfit)
colon_s %>% 
    summary_factorlist(dependent, explanatory, fit_id=TRUE) %>% 
    ff_merge(
        glmuni(colon_s, dependent, explanatory, weights = wgts) %>% 
            fit2df(estimate_suffix = " (univariable)")
    ) %>% 
    ff_merge(
        glm(ff_formula(dependent, explanatory), data=colon_s, 
                family = "binomial", weights = wgts) %>% 
            fit2df(estimate_suffix = " (multivariable)")
    ) %>% 
    dplyr::select(-fit_id, -index) %>% 
    dependent_label(colon_s, dependent)

tylcole commented 5 years ago

That's incredible! Also supporting strata and clusters (as in function svydesign) would be amazing.

On Mon, Mar 25, 2019 at 07:40 ewenharrison notifications@github.com wrote:

Closed #13 https://github.com/ewenharrison/finalfit/issues/13.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ewenharrison/finalfit/issues/13#event-2226757364, or mute the thread https://github.com/notifications/unsubscribe-auth/AL7HSB7jMXLIy0Rt446LtbkjlnitSaQuks5vaN_fgaJpZM4aI4lV .

ewenharrison commented 5 years ago

Thanks. Could you provide an example of what you suggest.

tylcole commented 5 years ago

Certainly, an example is in here: https://www.hcup-us.ahrq.gov/reports/methods/2017-01.pdf In the US, this is a commonly used set of databases.

From page 10 of the pdf:

install.packages("survey"); library(survey)

Specify the sampling design with sampling weights DISCWT,

hospital clusters HOSP_NRD, and stratification NRD_STRATUM

strnrddsgn <- svydesign(ids = ~HOSP_NRD, weights = ~DISCWT, strata = ~NRD_STRATUM, data = core)

On Wed, Mar 27, 2019 at 4:26 AM ewenharrison notifications@github.com wrote:

Closed #13 https://github.com/ewenharrison/finalfit/issues/13.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ewenharrison/finalfit/issues/13#event-2232756977, or mute the thread https://github.com/notifications/unsubscribe-auth/AL7HSLg2SO3q1wCT9jnzAy27yTxqpPKiks5va1VJgaJpZM4aI4lV .

ewenharrison commented 5 years ago

Hi,

I don't know too much about this area. I've added these functions and would appreciate if you could let me know if this is what you were thinking of. Also, are the examples correct.

Many thanks for your help with this.

Ewen

library(dplyr)
library(survey)

# Examples from survey::svyglm() help page

data(api)

# Label data frame
apistrat = apistrat %>% 
  mutate(
    api00 = ff_label(api00, "API in 2000 (api00)"),
    ell = ff_label(ell, "English language learners (percent)(ell)"),
    meals = ff_label(meals, "Meals eligible (percent)(meals)"),
    mobility = ff_label(mobility, "First year at the school (percent)(mobility)"),
    sch.wide = ff_label(sch.wide, "School-wide target met (sch.wide)")
    )

# Linear example
dependent = "api00"
explanatory = c("ell", "meals", "mobility")

# Stratified design
dstrat = svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)

# Univariable fit
fit_uni = dstrat %>%
    svyglmuni(dependent, explanatory) %>%
    fit2df(estimate_suffix = " (univariable)")

# Multivariable fit
fit_multi = dstrat %>%
    svyglmmulti(dependent, explanatory) %>%
    fit2df(estimate_suffix = " (multivariable)")

# Pipe together
apistrat %>%
    summary_factorlist(dependent, explanatory, fit_id = TRUE) %>%
    ff_merge(fit_uni) %>% 
    ff_merge(fit_multi) %>% 
    select(-fit_id, -index) %>%
    dependent_label(apistrat, dependent)

#               Dependent: API in 2000 (api00)             Mean (sd)    Coefficient (univariable)  Coefficient (multivariable)
#     English language learners (percent)(ell)  [0,84] 652.8 (121.0) -3.73 (-4.35--3.11, p<0.001)  -0.48 (-1.25-0.29, p=0.222)
#              Meals eligible (percent)(meals) [0,100] 652.8 (121.0) -3.38 (-3.71--3.05, p<0.001) -3.14 (-3.70--2.59, p<0.001)
# First year at the school (percent)(mobility)  [1,99] 652.8 (121.0)  -1.43 (-3.30-0.44, p=0.137)   0.23 (-0.54-1.00, p=0.567)

# Binomial example
## Note model family needs specified and exponentiation if desired
dependent = "sch.wide"
explanatory = c("ell", "meals", "mobility")

# Univariable fit
fit_uni = dstrat %>%
    svyglmuni(dependent, explanatory, family = "quasibinomial") %>%
    fit2df(exp = TRUE, estimate_name = "OR", estimate_suffix = " (univariable)")

# Multivariable fit
fit_multi = dstrat %>%
    svyglmmulti(dependent, explanatory, family = "quasibinomial") %>%
    fit2df(exp = TRUE, estimate_name = "OR", estimate_suffix = " (multivariable)")

# Pipe together
apistrat %>%
    summary_factorlist(dependent, explanatory, fit_id = TRUE) %>%
    ff_merge(fit_uni) %>% 
    ff_merge(fit_multi) %>% 
    select(-fit_id, -index) %>%
    dependent_label(apistrat, dependent)

# Dependent: School-wide target met (sch.wide)                    No         Yes          OR (univariable)        OR (multivariable)
#     English language learners (percent)(ell) Mean (SD) 22.5 (19.3) 20.5 (20.0) 1.00 (0.98-1.01, p=0.715) 1.00 (0.97-1.02, p=0.851)
#              Meals eligible (percent)(meals) Mean (SD) 46.0 (29.1) 44.7 (29.0) 1.00 (0.99-1.01, p=0.968) 1.00 (0.98-1.01, p=0.732)
# First year at the school (percent)(mobility) Mean (SD)  13.9 (8.6) 17.2 (13.0) 1.06 (1.00-1.12, p=0.049) 1.06 (1.00-1.13, p=0.058)

tylcole commented 5 years ago

That's fantastic, I should have time in the next week to go through this and confirm. I will also test on my own data. I'll respond as soon as I do. Thank you for this great update!

ewenharrison commented 5 years ago

Thanks. One issue is that the summary statistics from summary_factorlist() are not weighted. They cannot be weighted using this function. So to be useful it might be necessary to incorporate svymean() or similar.

What would be useful is to know how you would actually use this. What your final table would actually look like. What variations would be common in practice etc. Best wishes.

tylcole commented 5 years ago

Ewen I just took a look and that looks correct, though like you mention a key feature that would be really nice is the weighted means and SD in the table using svymean(). The final table would be just like you are demonstrating, except weighted mean/SD.

This is a basic example of what those tables would look like in a paper: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5624560/pdf/cureus-0009-00000001536.pdf

larmarange commented 5 years ago

Ideally, summary_factorlist() should test if .data is a svydesign object. In such case, all stats could be computed using survey functions. i.e. svytable, svymean, svychisq, etc.

The desired table is the same as with unweighted data. It's just that the computation of them should used survey functions.

ewenharrison commented 5 years ago

Thanks @larmarange Is it worth a new function? Would people use it?

larmarange commented 5 years ago

Dear @ewenharrison thank you very much for your feedback. You have done a fantastic job on finalfit.

I just spent 3 days teaching R to master students and teachers in population studies. They were very impressed by summary_fatorlist(), finalfit() and or_plot() as they are very easy to use.

In population studies, most of them are working with national surveys. Such surveys have complex sampling and/or post-calibration. Almost all of them required to use the survey package.

When I presented multivariate analysis using survey, they were very frustrated not to be able to use finalfit.

I would say it's really worth to extend summary_facorlist() to be able to deal with a survey design object instead of a data.frame and to use therefore survey functions (such as svytable(), svymean(), svychisq()...) to generate the table.

Similarly, if summary_factorlist() is adaped to manage survey.design class (it's simple to test if an object inherits that class to load a specific function), then it should be quite easy to adapt finalfit() and or_fit() as well.

By the way, just for you to know, survey also provide a svycoxph() function.

ewenharrison commented 5 years ago

Ok, well if you could point me to a suitable dataset and provide 2 or 3 example analyses using survey, together with the table output you would like to to see then I can have a look. Many of the examples in library(survey) help seem very complex and difficult to generalise.

larmarange commented 5 years ago

Please find below an exemple using a dataset provided by questionr package.


library(questionr)
data(hdv2003)

library(dplyr)
#> 
#> Attachement du package : 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
hdv2003$etud <- recode_factor(
  hdv2003$nivetud,
  "N'a jamais fait d'etudes" = "Primary",
  "A arrete ses etudes, avant la derniere annee d'etudes primaires" = "Primary",
  "Derniere annee d'etudes primaires" = "Primary",
  "1er cycle" = "Secondary",
  "2eme cycle" = "Secondary",
  "Enseignement technique ou professionnel court" = "Professionnal",
  "Enseignement technique ou professionnel long" = "Professionnal",
  "Enseignement superieur y compris technique superieur" = "High"
)

library(survey)
#> Le chargement a nécessité le package : grid
#> Le chargement a nécessité le package : Matrix
#> Le chargement a nécessité le package : survival
#> 
#> Attachement du package : 'survey'
#> The following object is masked from 'package:graphics':
#> 
#>     dotchart
dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003)

# Note: summary.formula not compatible with svydesign objects

# dependent variable
freq(svytable(~ sport, dw))
#>           n    % val%
#> Non 6714760 60.7 60.7
#> Oui 4356466 39.3 39.3

# binary explanatory
tab <- svytable(~ sexe + sport, dw)
tab
#>        sport
#> sexe        Non     Oui
#>   Homme 2850210 2299173
#>   Femme 3864550 2057293
rprop(tab)
#>           sport
#> sexe       Non   Oui   Total
#>   Homme     55.4  44.6 100.0
#>   Femme     65.3  34.7 100.0
#>   Ensemble  60.7  39.3 100.0
svychisq(~ sexe + sport, dw)
#> 
#>  Pearson's X^2: Rao & Scott adjustment
#> 
#> data:  svychisq(~sexe + sport, dw)
#> F = 12.444, ndf = 1, ddf = 1999, p-value = 0.0004289

# explanotory with more modalities
tab <- svytable(~ etud + sport, dw)
tab
#>                sport
#> etud                  Non       Oui
#>   Primary       1978619.6  234477.9
#>   Secondary     1428124.8  671021.2
#>   Professionnal 2050834.3 1315064.1
#>   High          1071009.1 1504802.3
rprop(tab)
#>                sport
#> etud            Non   Oui   Total
#>   Primary        89.4  10.6 100.0
#>   Secondary      68.0  32.0 100.0
#>   Professionnal  60.9  39.1 100.0
#>   High           41.6  58.4 100.0
#>   Ensemble       63.7  36.3 100.0
svychisq(~ etud + sport, dw)
#> 
#>  Pearson's X^2: Rao & Scott adjustment
#> 
#> data:  svychisq(~etud + sport, dw)
#> F = 45.593, ndf = 2.9831, ddf = 5963.2106, p-value < 2.2e-16

# quantitative explanatory
svyby(~ heures.tv, ~ sport, dw, svymean, na.rm = TRUE) # mean
#>     sport heures.tv         se
#> Non   Non  2.417582 0.06689688
#> Oui   Oui  1.785640 0.06573058
svyttest(heures.tv ~ sport, dw)
#> 
#>  Design-based t-test
#> 
#> data:  heures.tv ~ sport
#> t = -6.7382, df = 1993, p-value = 2.092e-11
#> alternative hypothesis: true difference in mean is not equal to 0
#> 95 percent confidence interval:
#>  -0.8157581 -0.4481261
#> sample estimates:
#> difference in mean 
#>         -0.6319421

svyby(~ heures.tv, ~ sport, dw, svyquantile, quantiles = 1:3/4, ci = TRUE, na.rm = TRUE) # quartiles
#> Warning in vcov.svyquantile(X[[i]], ...): Only diagonal of vcov() available

#> Warning in vcov.svyquantile(X[[i]], ...): Only diagonal of vcov() available
#>     sport 0.25 0.5 0.75   se.0.25    se.0.5   se.0.75
#> Non   Non    1   2    3 0.0000000 0.0000000 0.2551067
#> Oui   Oui    1   2    3 0.1020427 0.1371243 0.2551067
svyranktest(heures.tv ~ sport, dw)
#> 
#>  Design-based KruskalWallis test
#> 
#> data:  heures.tv ~ sport
#> t = -6.1587, df = 1993, p-value = 8.846e-10
#> alternative hypothesis: true difference in mean rank score is not equal to 0
#> sample estimates:
#> difference in mean rank score 
#>                   -0.09905157

# binary regression
reg <- svyglm(sport ~ sexe + etud + heures.tv, family = quasibinomial(), design = dw)
questionr::odds.ratio(reg)
#>                         OR    2.5 %  97.5 %         p    
#> (Intercept)        0.18564  0.11762  0.2930 6.902e-13 ***
#> sexeFemme          0.79913  0.61522  1.0380  0.093059 .  
#> etudSecondary      3.73818  2.32314  6.0151 6.261e-08 ***
#> etudProfessionnal  4.94649  3.18799  7.6750 1.398e-12 ***
#> etudHigh          10.16897  6.43759 16.0632 < 2.2e-16 ***
#> heures.tv          0.88979  0.81820  0.9677  0.006427 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
broom::tidy(reg, exponentiate = TRUE, conf.int = TRUE)
#> # A tibble: 6 x 7
#>   term             estimate std.error statistic  p.value conf.low conf.high
#>   <chr>               <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
#> 1 (Intercept)         0.186    0.233      -7.23 6.90e-13    0.118     0.293
#> 2 sexeFemme           0.799    0.133      -1.68 9.31e- 2    0.615     1.04 
#> 3 etudSecondary       3.74     0.243       5.43 6.26e- 8    2.32      6.02 
#> 4 etudProfessionn~    4.95     0.224       7.13 1.40e-12    3.19      7.67 
#> 5 etudHigh           10.2      0.233       9.94 9.76e-23    6.44     16.1  
#> 6 heures.tv           0.890    0.0428     -2.73 6.43e- 3    0.818     0.968

larmarange commented 5 years ago

In terms of expected output, the idea would be to be able to do the following :

dep <- "sport"
expl <- c("sexe", "etud", "heures.tv")

library(finalfit)
dw %>% summary_factorlist(dep, expl, p = TRUE) 
dw %>% finalfit(dep, expl)

The table will be exactly the same as with a classic data.frame, except that all stats would have been computed using survey functions, taking into account sample weights and sample design.

Best regards

larmarange commented 5 years ago

Just for information, if you prefer working with dplyr syntax, there is a package called srvyr extening dplyr verbs to survey objects.

https://cran.r-project.org/web/packages/srvyr/vignettes/srvyr-vs-survey.html

With this package you can do things like:

dw <- as_survey_design(dw) # to add tbl_svy class to dw
dw %>% group_by(sport) %>% summarise(mean = survey_mean(heures.tv, na.rm = TRUE))

Here, the result will be a tibble.

ewenharrison commented 5 years ago

Thanks Joseph for the very useful example.

This probably gets us quite far. There are summary_factorlist() features not yet included, most notably, labels. But that would be an easy addition.

Would be tidied up and split into component functions prior to being deployed.

Interested in thoughts.

library(questionr)
data(hdv2003)

library(dplyr)
hdv2003$etud <- recode_factor(
  hdv2003$nivetud,
  "N'a jamais fait d'etudes" = "Primary",
  "A arrete ses etudes, avant la derniere annee d'etudes primaires" = "Primary",
  "Derniere annee d'etudes primaires" = "Primary",
  "1er cycle" = "Secondary",
  "2eme cycle" = "Secondary",
  "Enseignement technique ou professionnel court" = "Professionnal",
  "Enseignement technique ou professionnel long" = "Professionnal",
  "Enseignement superieur y compris technique superieur" = "High"
)

library(survey)
dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003)

summary_survey <- function(.data, dependent, explanatory, cont = "mean", cont_range = FALSE, 
                           p = FALSE, column = FALSE, digits = c(1, 1, 3, 1), fit_id = FALSE){
  # Define variable type
  explanatory_type = .data$variables %>% 
    select(explanatory) %>% 
    purrr::map(is.numeric)

  # Hypothesis test
  if(p){
    p_tests = explanatory %>% 
      purrr::map2(explanatory_type,
                  ~if(!.y){
                    survey::svychisq(as.formula(paste0("~", ., "+", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  } else if (.y & cont == "mean"){
                    survey::svyttest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  } else if (.y & cont == "median"){
                    survey::svyranktest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  }
      )
  }

  # Output table
  explanatory %>% 
    purrr::map2(explanatory_type,
                ~ if(!.y){
                  survey::svytable(as.formula(paste0("~", .x, "+", dependent)), .data) %>% 
                    as.data.frame(stringsAsFactors = FALSE) %>% 
                    { if(column) {
                      dplyr::group_by(., !! sym(dependent)) %>% 
                        dplyr::mutate(
                          total = sum(Freq),
                          prop = 100 * Freq / total
                        )
                    } else { 
                      dplyr::group_by_at(., 1) %>% 
                        dplyr::mutate(
                          total = sum(Freq),
                          prop = 100 * Freq / total
                        )
                    }
                    } %>% 
                    dplyr::mutate(
                      value = paste0(Freq %>% round_tidy(digits[4]), " (", 
                                     prop %>% round_tidy(digits[4]), ")")
                    ) %>%
                    dplyr::select(-total, -prop, -Freq) %>% 
                    tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                    dplyr::mutate(
                      label = names(.)[1]
                    ) %>% 
                    dplyr::rename(levels = 1)
                } else {
                  { if(cont == "mean") {
                    survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data, svymean, na.rm = TRUE) %>%
                      dplyr::mutate(
                        value = paste0(!! sym(.x) %>% round_tidy(digits[1]), " (", 
                                       se %>%  round_tidy(digits[2]), ")") 
                      ) %>% 
                      dplyr::select(-c(.x, se)) %>% 
                      tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                      dplyr::mutate(
                        label = .x,
                        levels = "Mean (SE)"
                      )
                  } else if(cont == "median"){
                    survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data, 
                                  svyquantile, quantiles = 1:3/4, ci = TRUE, na.rm = TRUE) %>% 
                      dplyr::rename(Q1 = 2,
                                    Q2 = 3,
                                    Q3 = 4) %>% 
                      dplyr::mutate(
                        IQR = Q3 - Q1) %>% 
                      { if(cont_range){
                        dplyr::mutate(., 
                                      value = paste0(Q2 %>% round_tidy(digits[1]), " (", 
                                                     Q1 %>%  round_tidy(digits[2]), " to ",
                                                     Q3 %>% round_tidy(digits[2]), ")")
                        )                        
                      } else {
                        dplyr::mutate(.,
                                      value = paste0(Q2 %>% round_tidy(digits[1]), " (", 
                                                     IQR %>%  round_tidy(digits[2]), ")")
                        )
                      }} %>% 
                      dplyr::select(-c(2:8)) %>%                      
                      tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                      dplyr::mutate(
                        label = .x,
                        levels = "Median (IQR)"
                      )
                  }
                  }
                }) %>% 

    # Add hypothesis test
    { if(p){
      purrr::map2_df(., p_tests,
                     ~ mutate(.x,
                              p = .y)
      )} else {
        dplyr::bind_rows(.)
      }} %>%
    dplyr::select(label, levels, dplyr::everything()) %>% 
    as.data.frame() %>%
    { if(fit_id){
      levels_id = .$levels
      drop = levels_id %in% c("Mean (SE)", "Median (IQR)")
      levels_id[drop] = ""
      mutate(., 
             fit_id = paste0(label, levels_id),
             index = 1:dim(.)[1])
    } else {
      .
    }} %>% 
    rm_duplicate_labels()
}

library(finalfit)
dependent = "sport"

# Choose different options to test
explanatory = c("sexe")
explanatory = c("heures.tv")
explanatory = c("sexe", "heures.tv", "etud")

dw %>% 
  summary_survey(dependent, explanatory, cont = "mean", cont_range = FALSE, p = FALSE, column = FALSE, 
                 digits = c(1, 1, 3, 0))

ff_merge(
  summary_survey(dw, dependent, explanatory, fit_id = TRUE),
  svyglmmulti(dw, dependent, explanatory, family = "quasibinomial") %>%
    fit2df(exp = TRUE, estimate_name = "OR (multivariable)"),
  last_merge = TRUE
)

larmarange commented 5 years ago

Thanks a lot. That's great!!!!

Just a small comment: for digits argument, it would be nice to consider a 5th value correspond to the number of digits for the number of observations (for example, to be able to use 0 for the number of observations while keeping 1 for percentage).

Otherwise, everything seems to behave as expected.

Thanks again

tylcole commented 4 years ago

Thank you both, I had also noted the usefulness of having weighted functions. I am glad this is a feature now!

Tyler Cole

On Tue, Oct 29, 2019 at 2:17 AM Joseph notifications@github.com wrote:

Thanks a lot. That's great!!!!

Just a small comment: for digits argument, it would be nice to consider a 5th value correspond to the number of digits for the number of observations (for example, to be able to use 0 for the number of observations while keeping 1 for percentage).

Otherwise, everything seems to behave as expected.

Thanks again

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ewenharrison/finalfit/issues/13?email_source=notifications&email_token=AC7MOSCMNDVZDXNNFCKMGNLQQ75TLA5CNFSM4GRDRFK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECPY2OQ#issuecomment-547327290, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7MOSGYNEHGB7ZB4TJRTJTQQ75TLANCNFSM4GRDRFKQ .

nobu-nsk commented 4 years ago

Thank you very much for this excellent development. I also needed exactly this functionality.

I noticed that the currently Mean (SE) is used instead of (SD). Definitely, more often we need SD for reporting together with means. Unfortunately, svymean package does not provide SD but it can be derived easily from svyvar:

svyvar(~x,design)[1] %>% sqrt() or svyby(~x, ~dependent, design, svyvar, na.rm = TRUE)[2] %>% sqrt()

I would be great to have Mean (SD), which is also ensuring consistency with summary_factorlist().

Nobu Nishikiori

tylcole commented 4 years ago

I very much agree with this

On Sun, Nov 24, 2019 at 10:46 nobu-nsk notifications@github.com wrote:

Thank you very much for this excellent development. I also needed exactly this functionality.

I noticed that the currently Mean (SE) is used instead of (SD). Definitely, more often we need SD for reporting together with means. Unfortunately, svymean package does not provide SD but it can be derived easily from svyvar:

svyvar(~x,design)[1] %>% sqrt() or svyby(~x, ~dependent, design, svyvar, na.rm = TRUE)[2] %>% sqrt()

I would be great to have Mean (SD), which is also ensuring consistency with summary_factorlist().

Nobu Nishikiori

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ewenharrison/finalfit/issues/13?email_source=notifications&email_token=AC7MOSGM3QAV6R5IWXY2EG3QVKOWLA5CNFSM4GRDRFK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFAOICI#issuecomment-557900809, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7MOSDKYRMTSN6OJBQ6XXLQVKOWLANCNFSM4GRDRFKQ .

-- Tyler Cole

ewenharrison commented 4 years ago

Thanks both. Just to confirm, there is not a library(survey) function that will give both the mean and the sd in the same function? I have re-written the original summary_factorlist() to remove some old dependencies and incorporate requested features. We'll get that and this out soon. Thanks again. Ewen

tylcole commented 4 years ago

I only know of: https://rdrr.io/cran/jtools/man/svysd.html

On Sun, Nov 24, 2019 at 15:15 ewenharrison notifications@github.com wrote:

Thanks both. Just to confirm, there is not a library(survey) function that will give both the mean and the sd in the same function? I have re-written the original summary_factorlist() to remove some old dependencies and incorporate requested features. We'll get that and this out soon. Thanks again. Ewen

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/ewenharrison/finalfit/issues/13?email_source=notifications&email_token=AC7MOSHHHCQWZMGAZNWQB3TQVLOHTA5CNFSM4GRDRFK2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFAT2YY#issuecomment-557923683, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC7MOSBW34EBEXJ3EQADQBTQVLOHTANCNFSM4GRDRFKQ .

-- Tyler Cole

nobu-nsk commented 4 years ago

I believe so.

Just to confirm, there is not a library(survey) function that will give both the mean and the sd in the same function? Thank you very much for incorporating the feature. Nobu

e-ibrahim commented 4 years ago

Hi @ewenharrison

Please, I am a newbie to R and I am trying to adapt your "summary_survey(dw, dependent, explanatory, fit_id = TRUE)" solution to my work with below parameters:

surveydes=svydesign(id=~psu, strata=~strata, weights=~pweight, data=data) dependent="knowledge" explanatory=c("agegroup", "married", "children")

knowledge (factor var: knowledgeable vs. not knowledgeable)

agegroup (factor var: 15-24 vs. 25-34, 35-49)

married (factor var: never married vs. ever married)

children (integer: 0, 1, 2, ...,19

surveydes %>% summary_survey(dependent, explanatory, cont = "mean", cont_range = FALSE, p = FALSE, column = FALSE, digits = c(1, 1, 3, 0))

Each time I do run my code, it returns: "Error in select(., explanatory) : unused argument (explanatory)"

I copied and pasted your "summary_survey" source codes in my R environment. But, I am certain there is something I have either misapplied or am not doing right. Could you please advise?

Thank you, Elhakim

ewenharrison commented 4 years ago

Hi, not sure why it is not working for you. Do you have a reproducible example?

ABSANDIE commented 2 years ago

Thanks Joseph for the very useful example.

This probably gets us quite far. There are summary_factorlist() features not yet included, most notably, labels. But that would be an easy addition.

Would be tidied up and split into component functions prior to being deployed.

Interested in thoughts.

library(questionr)
data(hdv2003)

library(dplyr)
hdv2003$etud <- recode_factor(
  hdv2003$nivetud,
  "N'a jamais fait d'etudes" = "Primary",
  "A arrete ses etudes, avant la derniere annee d'etudes primaires" = "Primary",
  "Derniere annee d'etudes primaires" = "Primary",
  "1er cycle" = "Secondary",
  "2eme cycle" = "Secondary",
  "Enseignement technique ou professionnel court" = "Professionnal",
  "Enseignement technique ou professionnel long" = "Professionnal",
  "Enseignement superieur y compris technique superieur" = "High"
)

library(survey)
dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003)

summary_survey <- function(.data, dependent, explanatory, cont = "mean", cont_range = FALSE, 
                           p = FALSE, column = FALSE, digits = c(1, 1, 3, 1), fit_id = FALSE){
  # Define variable type
  explanatory_type = .data$variables %>% 
    select(explanatory) %>% 
    purrr::map(is.numeric)

  # Hypothesis test
  if(p){
    p_tests = explanatory %>% 
      purrr::map2(explanatory_type,
                  ~if(!.y){
                    survey::svychisq(as.formula(paste0("~", ., "+", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  } else if (.y & cont == "mean"){
                    survey::svyttest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  } else if (.y & cont == "median"){
                    survey::svyranktest(as.formula(paste0(.x, "~", dependent)), .data)$p.value %>% 
                      p_tidy(digits[3])
                  }
      )
  }

  # Output table
  explanatory %>% 
    purrr::map2(explanatory_type,
                ~ if(!.y){
                  survey::svytable(as.formula(paste0("~", .x, "+", dependent)), .data) %>% 
                    as.data.frame(stringsAsFactors = FALSE) %>% 
                    { if(column) {
                      dplyr::group_by(., !! sym(dependent)) %>% 
                        dplyr::mutate(
                          total = sum(Freq),
                          prop = 100 * Freq / total
                        )
                    } else { 
                      dplyr::group_by_at(., 1) %>% 
                        dplyr::mutate(
                          total = sum(Freq),
                          prop = 100 * Freq / total
                        )
                    }
                    } %>% 
                    dplyr::mutate(
                      value = paste0(Freq %>% round_tidy(digits[4]), " (", 
                                     prop %>% round_tidy(digits[4]), ")")
                    ) %>%
                    dplyr::select(-total, -prop, -Freq) %>% 
                    tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                    dplyr::mutate(
                      label = names(.)[1]
                    ) %>% 
                    dplyr::rename(levels = 1)
                } else {
                  { if(cont == "mean") {
                    survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data, svymean, na.rm = TRUE) %>%
                      dplyr::mutate(
                        value = paste0(!! sym(.x) %>% round_tidy(digits[1]), " (", 
                                       se %>%  round_tidy(digits[2]), ")") 
                      ) %>% 
                      dplyr::select(-c(.x, se)) %>% 
                      tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                      dplyr::mutate(
                        label = .x,
                        levels = "Mean (SE)"
                      )
                  } else if(cont == "median"){
                    survey::svyby(as.formula(paste0("~", .x)), as.formula(paste0("~", dependent)), .data, 
                                  svyquantile, quantiles = 1:3/4, ci = TRUE, na.rm = TRUE) %>% 
                      dplyr::rename(Q1 = 2,
                                    Q2 = 3,
                                    Q3 = 4) %>% 
                      dplyr::mutate(
                        IQR = Q3 - Q1) %>% 
                      { if(cont_range){
                        dplyr::mutate(., 
                                      value = paste0(Q2 %>% round_tidy(digits[1]), " (", 
                                                     Q1 %>%  round_tidy(digits[2]), " to ",
                                                     Q3 %>% round_tidy(digits[2]), ")")
                        )                        
                      } else {
                        dplyr::mutate(.,
                                      value = paste0(Q2 %>% round_tidy(digits[1]), " (", 
                                                     IQR %>%  round_tidy(digits[2]), ")")
                        )
                      }} %>% 
                      dplyr::select(-c(2:8)) %>%                      
                      tidyr::pivot_wider(names_from = !! dependent, values_from = value) %>% 
                      dplyr::mutate(
                        label = .x,
                        levels = "Median (IQR)"
                      )
                  }
                  }
                }) %>% 

    # Add hypothesis test
    { if(p){
      purrr::map2_df(., p_tests,
                     ~ mutate(.x,
                              p = .y)
      )} else {
        dplyr::bind_rows(.)
      }} %>%
    dplyr::select(label, levels, dplyr::everything()) %>% 
    as.data.frame() %>%
    { if(fit_id){
      levels_id = .$levels
      drop = levels_id %in% c("Mean (SE)", "Median (IQR)")
      levels_id[drop] = ""
      mutate(., 
             fit_id = paste0(label, levels_id),
             index = 1:dim(.)[1])
    } else {
      .
    }} %>% 
    rm_duplicate_labels()
}

library(finalfit)
dependent = "sport"

# Choose different options to test
explanatory = c("sexe")
explanatory = c("heures.tv")
explanatory = c("sexe", "heures.tv", "etud")

dw %>% 
  summary_survey(dependent, explanatory, cont = "mean", cont_range = FALSE, p = FALSE, column = FALSE, 
                 digits = c(1, 1, 3, 0))

ff_merge(
  summary_survey(dw, dependent, explanatory, fit_id = TRUE),
  svyglmmulti(dw, dependent, explanatory, family = "quasibinomial") %>%
    fit2df(exp = TRUE, estimate_name = "OR (multivariable)"),
  last_merge = TRUE
)

Many thanks for "summary_survey", it is a very useful function. Please I would like to have the row/col total in the output table.

ABSANDIE commented 2 years ago

Many thanks for "summary_survey", it is a very useful function. Please I would like to have the row/col total in the output table.

ewenharrison commented 2 years ago

Many thanks for writing. Yes, this is not yet implemented. I guess you would need an observed total as well as a weighted total? Best wishes, Ewen

ABSANDIE commented 2 years ago

@ewenharrison thanks for the reply Yes, it would be great to have both totals if possible.

ewenharrison commented 2 years ago

Do you have published examples of how these tables are normally presented?

ABSANDIE commented 2 years ago

label | levels | X1 | X2 | Total -- | -- | -- | -- | -- Sexe | Féminin | 761.6 (77.1) | 23242.2 (85.7) | 24003.8(81.4) | Masculin | 226.6 (22.9) | 3881.9 (14.3) | 4108.5 (18.6) | Total | 988.2(100.0) | 27124.1(100.0) | 28112.3(100.0)

ABSANDIE commented 2 years ago

@ewenharrison please I think the weighted total will be enough. Since the observed total could be obtained from the summary_factorlist function

ewenharrison commented 2 years ago

Hi, thanks for emailing about this again. Sorry if I misled, there is no plan for me to develop this function at the moment. I guess there are a lot of people out there doing this sort of things - can't you find anything?

If you simply want the totals from the weighted analysis, could you not do this?

library(questionr)
library(tidyverse)
library(finalfit)
library(survey)
data(hdv2003)
dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003)
dependent = "sport"
explanatory = c("sexe")

dw %>% 
    summary_survey(dependent, explanatory) %>% 
    bind_cols(Total = svytotal(as.formula(paste0("~", explanatory)), dw))
label levels              Non              Oui   Total
1  sexe  Homme 2850209.9 (55.4) 2299172.5 (44.6) 5149382
2        Femme 3864550.4 (65.3) 2057293.5 (34.7) 5921844

larmarange commented 2 years ago

A solution for weighted tables is tbl_svysummary() in gtsummary package

Although the implementation is different, the philosophy of gtsummary is similar and you can obtain publication ready tables.

Regards

Joseph Larmarange

Le mar. 10 mai 2022 à 11:31, ewenharrison @.***> a écrit :

Hi, thanks for emailing about this again. Sorry if I misled, there is no plan for me to develop this function at the moment. I guess there are a lot of people out there doing this sort of things - can't you find anything?

If you simply want the totals from the weighted analysis, could you not do this?

library(questionr) library(tidyverse) library(finalfit) library(survey) data(hdv2003) dw <- svydesign(ids = ~ 1, weights = ~ poids, data = hdv2003) dependent = "sport" explanatory = c("sexe")

dw %>% summary_survey(dependent, explanatory) %>% bind_cols(Total = svytotal(as.formula(paste0("~", explanatory)), dw)) label levels Non Oui Total 1 sexe Homme 2850209.9 (55.4) 2299172.5 (44.6) 5149382 2 Femme 3864550.4 (65.3) 2057293.5 (34.7) 5921844

— Reply to this email directly, view it on GitHub https://github.com/ewenharrison/finalfit/issues/13#issuecomment-1122268397, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHL5I6EZ6CTV7VJYKK7CILVJJCHXANCNFSM4GRDRFKQ . You are receiving this because you were mentioned.Message ID: @.***>

ewenharrison / finalfit

Weighted tables? #13

Specify the sampling design with sampling weights DISCWT,

hospital clusters HOSP_NRD, and stratification NRD_STRATUM

knowledge (factor var: knowledgeable vs. not knowledgeable)

agegroup (factor var: 15-24 vs. 25-34, 35-49)

married (factor var: never married vs. ever married)

children (integer: 0, 1, 2, ...,19