gergness / srvyr

R package to add 'dplyr'-like Syntax for Summary Statistics of Survey Data
214 stars 28 forks source link

interact & cascade don't play nicely #133

Closed gergness closed 2 years ago

gergness commented 3 years ago

Reported by @szimmer as part of #131

library(srvyr)
data(api, package = "survey")

dstrata <- apistrat %>%
  as_survey(strata = stype, weights = pw)

dstrata %>%
  group_by(interact(stype, awards)) %>%
  cascade(x = survey_mean())
#> Warning in if (class(.data$variables[[ggg]]) == "factor" & !is.na(.fill)) {: the
#> condition has length > 1 and only the first element will be used
#> # A tibble: 7 × 5
#>   stype awards      x    x_se `interact(stype, awards)`
#>   <fct> <fct>   <dbl>   <dbl> <lgl>                    
#> 1 E     No     0.193  0.0318  NA                       
#> 2 E     Yes    0.521  0.0318  NA                       
#> 3 H     No     0.0829 0.00812 NA                       
#> 4 H     Yes    0.0390 0.00812 NA                       
#> 5 M     No     0.0855 0.0117  NA                       
#> 6 M     Yes    0.0789 0.0117  NA                       
#> 7 <NA>  <NA>   1      0       NA
szimmer commented 3 years ago

I thought I would share what some other software does with marginals as guidance for how cascade might handle interact.

SUDAAN:

------------------------------------------
STYPE_NUM                                 
   AWARDS_NUM          Tot        SE Tot  
                       Percent    Percent 
------------------------------------------
Total                                     
   Total                 100.00       0.00
   Yes                    63.89       3.49
   No                     36.11       3.49
E                                         
   Total                  71.38       0.00
   Yes                    52.10       3.18
   No                     19.27       3.18
M                                         
   Total                  16.44       0.00
   Yes                     7.89       1.17
   No                      8.55       1.17
H                                         
   Total                  12.19       0.00
   Yes                     3.90       0.81
   No                      8.29       0.81
------------------------------------------

SAS (proc surveyfreq):

stype_num awards_num Freq SE Freq Percent SE Percent
E Yes 3227 197.263 52.1041 3.1847
  No 1194 197.263 19.2714 3.1847
  Total 4421 0 71.3755 0
M Yes 488.64 72.6561 7.8889 1.173
  No 529.36 72.6561 8.5463 1.173
  Total 1018 0 16.4353 0
H Yes 241.6 50.3128 3.9005 0.8123
  No 513.4 50.3128 8.2887 0.8123
  Total 755 0 12.1892 0
Total Yes 3958 216.155 63.8936 3.4898
  No 2236 216.155 36.1064 3.4898
  Total 6194 0 100  
gergness commented 3 years ago

Thanks! I think that clarifies what the 2 variable case should be, but do SAS or SUDAAN let you make a 3 (or more) way table here? Do you have thoughts on how it should work?

library(srvyr)
library(dplyr)
data(api, package = "survey")

dstrata <- apistrat %>%
  as_survey(strata = stype, weights = pw)

# dstrata %>%
#   group_by(interact(stype, awards)) %>%
#   summarize(x = survey_mean())

# Seems like it should summarize by 4 separate groupings:
# - interact(stype, awards)
# - stype
# - awards
# - None

# Equal to:
bind_rows(
  dstrata %>% group_by(interact(stype, awards)) %>% summarize(x = survey_mean()),
  dstrata %>% group_by(stype) %>% summarize(x = survey_mean()),
  dstrata %>% group_by(awards) %>% summarize(x = survey_mean()),
  dstrata %>% summarize(x = survey_mean())
)
#> # A tibble: 12 × 4
#>    stype awards      x    x_se
#>    <fct> <fct>   <dbl>   <dbl>
#>  1 E     No     0.193  0.0318 
#>  2 E     Yes    0.521  0.0318 
#>  3 H     No     0.0829 0.00812
#>  4 H     Yes    0.0390 0.00812
#>  5 M     No     0.0855 0.0117 
#>  6 M     Yes    0.0789 0.0117 
#>  7 E     <NA>   0.714  0      
#>  8 H     <NA>   0.122  0      
#>  9 M     <NA>   0.164  0      
#> 10 <NA>  No     0.361  0.0349 
#> 11 <NA>  Yes    0.639  0.0349 
#> 12 <NA>  <NA>   1      0

Three variable case is harder to reason about though. If a user specifies 3 variables in an interact term, I kind of think it should give the 2 variable interactions as well as single variables and none.

# But what about when there are 3 variables?
# dstrata %>%
#   group_by(interact(stype, awards, yr.rnd)) %>%
#   summarize(x = survey_mean())

# Should there be all of these?
# - interact(stype, awards, yr.rnd)
# - interact(stype, awards)
# - interact(stype, yr.rnd)
# - interact(awards, yr.rnd)
# - stype
# - awards
# - yr.rnd
# - None

bind_rows(
  dstrata %>% group_by(interact(stype, awards, yr.rnd)) %>% summarize(x = survey_mean()),
  dstrata %>% group_by(interact(stype, awards)) %>% summarize(x = survey_mean()),
  dstrata %>% group_by(interact(stype, yr.rnd)) %>% summarize(x = survey_mean()),
  dstrata %>% group_by(interact(awards, yr.rnd)) %>% summarize(x = survey_mean()),
  dstrata %>% group_by(stype) %>% summarize(x = survey_mean()),
  dstrata %>% group_by(awards) %>% summarize(x = survey_mean()),
  dstrata %>% group_by(yr.rnd) %>% summarize(x = survey_mean()),
  dstrata %>% summarize(x = survey_mean())
)  %>% print(n = 35)
#> # A tibble: 35 × 5
#>    stype awards yr.rnd       x    x_se
#>    <fct> <fct>  <fct>    <dbl>   <dbl>
#>  1 E     No     No     0.171   0.0306 
#>  2 E     No     Yes    0.0214  0.0122 
#>  3 E     Yes    No     0.414   0.0354 
#>  4 E     Yes    Yes    0.107   0.0256 
#>  5 H     No     No     0.0804  0.00825
#>  6 H     No     Yes    0.00244 0.00244
#>  7 H     Yes    No     0.0390  0.00812
#>  8 M     No     No     0.0822  0.0117 
#>  9 M     No     Yes    0.00329 0.00329
#> 10 M     Yes    No     0.0756  0.0117 
#> 11 M     Yes    Yes    0.00329 0.00329
#> 12 E     No     <NA>   0.193   0.0318 
#> 13 E     Yes    <NA>   0.521   0.0318 
#> 14 H     No     <NA>   0.0829  0.00812
#> 15 H     Yes    <NA>   0.0390  0.00812
#> 16 M     No     <NA>   0.0855  0.0117 
#> 17 M     Yes    <NA>   0.0789  0.0117 
#> 18 E     <NA>   No     0.585   0.0276 
#> 19 E     <NA>   Yes    0.128   0.0276 
#> 20 H     <NA>   No     0.119   0.00244
#> 21 H     <NA>   Yes    0.00244 0.00244
#> 22 M     <NA>   No     0.158   0.00460
#> 23 M     <NA>   Yes    0.00657 0.00460
#> 24 <NA>  No     No     0.334   0.0338 
#> 25 <NA>  No     Yes    0.0271  0.0129 
#> 26 <NA>  Yes    No     0.529   0.0382 
#> 27 <NA>  Yes    Yes    0.110   0.0258 
#> 28 E     <NA>   <NA>   0.714   0      
#> 29 H     <NA>   <NA>   0.122   0      
#> 30 M     <NA>   <NA>   0.164   0      
#> 31 <NA>  No     <NA>   0.361   0.0349 
#> 32 <NA>  Yes    <NA>   0.639   0.0349 
#> 33 <NA>  <NA>   No     0.863   0.0280 
#> 34 <NA>  <NA>   Yes    0.137   0.0280 
#> 35 <NA>  <NA>   <NA>   1       0

And I think it makes sense to unpeel the ones outside of an interact like this:

# Or mix of interacted and not?
# dstrata %>%
#   group_by(interact(stype, awards), yr.rnd) %>%
#   summarize(x = survey_mean())

# Just these?
# - interact(stype, awards) yr.rnd
# - interact(stype, awards)
# - stype
# - awards
# - None

# dstrata %>%
#   group_by(yr.rnd, interact(stype, awards)) %>%
#   summarize(x = survey_mean())
# Just these?
# - yr.rnd interact(stype, awards)
# - yr.rnd stype
# - yr.rnd awards
# - yr.rnd
# - None

Possibly should add a .groupings argument to cascade so that users can override:

# dstrata %>%
#   cascade(x = survey_mean(), .groupings = list(interact(stype, awards), NULL))

# - interact(stype, awards)
# - None

# (should warn if used on a tbl_df that's already grouped, since existing groups will be ignored)
szimmer commented 3 years ago

In SAS and SUDAAN, the order matters and something written like var3var2var1 would have var3 as a grouping (stratification) variable as seen in the output below. This might be more what you would expect to see in srvyr with group_by(var3, interact(var2, var1)). SAS behaves the same way so not including that output.

SUDAAN code and abbreviated output:

proc crosstab data=apistrat design=strwr;
weight pw;
nest stype_num;
class stype_num awards_num yr_rnd_num;
table stype_num*awards_num*yr_rnd_num;
print totper setot/style=nchs;
run;
for: STYPE = Total.                          

------------------------------------------       
AWARDS                                       
   YR_RND          Tot        SE Tot         
                       Percent    Percent        
------------------------------------------       
Total                                            
   Total                 100.00       0.00       
   Yes                    13.75       2.80       
   No                     86.25       2.80       
Yes                                              
   Total                  63.89       3.49       
   Yes                    11.04       2.58       
   No                     52.86       3.82       
No                                               
   Total                  36.11       3.49       
   Yes                     2.71       1.29       
   No                     33.39       3.38       
------------------------------------------       

for: STYPE = E.                              

------------------------------------------       
AWARDS                                       
   YR_RND          Tot        SE Tot         
                       Percent    Percent        
------------------------------------------       
Total                                            
   Total                 100.00       0.00       
   Yes                    18.00       3.86       
   No                     82.00       3.86       
Yes                                              
   Total                  73.00       4.46       
   Yes                    15.00       3.59       
   No                     58.00       4.96       
No                                               
   Total                  27.00       4.46       
   Yes                     3.00       1.71       
   No                     24.00       4.29       
------------------------------------------       

for: STYPE = M.                              

------------------------------------------       
AWARDS                                       
   YR_RND          Tot        SE Tot         
                       Percent    Percent        
------------------------------------------       
Total                                            
   Total                 100.00       0.00       
   Yes                     4.00       2.80       
   No                     96.00       2.80       
Yes                                              
   Total                  48.00       7.14       
   Yes                     2.00       2.00       
   No                     46.00       7.12       
No                                               
   Total                  52.00       7.14       
   Yes                     2.00       2.00       
   No                     50.00       7.14       
------------------------------------------       

for: STYPE = H.                              

------------------------------------------       
AWARDS                                       
   YR_RND          Tot        SE Tot         
                       Percent    Percent        
------------------------------------------       
Total                                            
   Total                 100.00       0.00       
   Yes                     2.00       2.00       
   No                     98.00       2.00       
Yes                                              
   Total                  32.00       6.66       
   Yes                     0.00       0.00       
   No                     32.00       6.66       
No                                               
   Total                  68.00       6.66       
   Yes                     2.00       2.00       
   No                     66.00       6.77       
------------------------------------------  
gergness commented 3 years ago

Ha, yeah, I guess the 3 way interaction is a little overboard, but I think it makes sense within the grammar of srvyr. Without the concept of row/column/total percents, srvyr needs the different peeling behaviors to allow for the same calculations. Then as a consequence, srvyr allows for a "table-total" percent that total across tabs (which doesn't seem possible in SUDAAN).

Thanks for those examples!