Closed gergness closed 2 years ago
I thought I would share what some other software does with marginals as guidance for how cascade might handle interact.
SUDAAN:
------------------------------------------
STYPE_NUM
AWARDS_NUM Tot SE Tot
Percent Percent
------------------------------------------
Total
Total 100.00 0.00
Yes 63.89 3.49
No 36.11 3.49
E
Total 71.38 0.00
Yes 52.10 3.18
No 19.27 3.18
M
Total 16.44 0.00
Yes 7.89 1.17
No 8.55 1.17
H
Total 12.19 0.00
Yes 3.90 0.81
No 8.29 0.81
------------------------------------------
SAS (proc surveyfreq):
stype_num | awards_num | Freq | SE Freq | Percent | SE Percent |
---|---|---|---|---|---|
E | Yes | 3227 | 197.263 | 52.1041 | 3.1847 |
No | 1194 | 197.263 | 19.2714 | 3.1847 | |
Total | 4421 | 0 | 71.3755 | 0 | |
M | Yes | 488.64 | 72.6561 | 7.8889 | 1.173 |
No | 529.36 | 72.6561 | 8.5463 | 1.173 | |
Total | 1018 | 0 | 16.4353 | 0 | |
H | Yes | 241.6 | 50.3128 | 3.9005 | 0.8123 |
No | 513.4 | 50.3128 | 8.2887 | 0.8123 | |
Total | 755 | 0 | 12.1892 | 0 | |
Total | Yes | 3958 | 216.155 | 63.8936 | 3.4898 |
No | 2236 | 216.155 | 36.1064 | 3.4898 | |
Total | 6194 | 0 | 100 |
Thanks! I think that clarifies what the 2 variable case should be, but do SAS or SUDAAN let you make a 3 (or more) way table here? Do you have thoughts on how it should work?
library(srvyr)
library(dplyr)
data(api, package = "survey")
dstrata <- apistrat %>%
as_survey(strata = stype, weights = pw)
# dstrata %>%
# group_by(interact(stype, awards)) %>%
# summarize(x = survey_mean())
# Seems like it should summarize by 4 separate groupings:
# - interact(stype, awards)
# - stype
# - awards
# - None
# Equal to:
bind_rows(
dstrata %>% group_by(interact(stype, awards)) %>% summarize(x = survey_mean()),
dstrata %>% group_by(stype) %>% summarize(x = survey_mean()),
dstrata %>% group_by(awards) %>% summarize(x = survey_mean()),
dstrata %>% summarize(x = survey_mean())
)
#> # A tibble: 12 × 4
#> stype awards x x_se
#> <fct> <fct> <dbl> <dbl>
#> 1 E No 0.193 0.0318
#> 2 E Yes 0.521 0.0318
#> 3 H No 0.0829 0.00812
#> 4 H Yes 0.0390 0.00812
#> 5 M No 0.0855 0.0117
#> 6 M Yes 0.0789 0.0117
#> 7 E <NA> 0.714 0
#> 8 H <NA> 0.122 0
#> 9 M <NA> 0.164 0
#> 10 <NA> No 0.361 0.0349
#> 11 <NA> Yes 0.639 0.0349
#> 12 <NA> <NA> 1 0
Three variable case is harder to reason about though. If a user specifies 3 variables in an interact term, I kind of think it should give the 2 variable interactions as well as single variables and none.
# But what about when there are 3 variables?
# dstrata %>%
# group_by(interact(stype, awards, yr.rnd)) %>%
# summarize(x = survey_mean())
# Should there be all of these?
# - interact(stype, awards, yr.rnd)
# - interact(stype, awards)
# - interact(stype, yr.rnd)
# - interact(awards, yr.rnd)
# - stype
# - awards
# - yr.rnd
# - None
bind_rows(
dstrata %>% group_by(interact(stype, awards, yr.rnd)) %>% summarize(x = survey_mean()),
dstrata %>% group_by(interact(stype, awards)) %>% summarize(x = survey_mean()),
dstrata %>% group_by(interact(stype, yr.rnd)) %>% summarize(x = survey_mean()),
dstrata %>% group_by(interact(awards, yr.rnd)) %>% summarize(x = survey_mean()),
dstrata %>% group_by(stype) %>% summarize(x = survey_mean()),
dstrata %>% group_by(awards) %>% summarize(x = survey_mean()),
dstrata %>% group_by(yr.rnd) %>% summarize(x = survey_mean()),
dstrata %>% summarize(x = survey_mean())
) %>% print(n = 35)
#> # A tibble: 35 × 5
#> stype awards yr.rnd x x_se
#> <fct> <fct> <fct> <dbl> <dbl>
#> 1 E No No 0.171 0.0306
#> 2 E No Yes 0.0214 0.0122
#> 3 E Yes No 0.414 0.0354
#> 4 E Yes Yes 0.107 0.0256
#> 5 H No No 0.0804 0.00825
#> 6 H No Yes 0.00244 0.00244
#> 7 H Yes No 0.0390 0.00812
#> 8 M No No 0.0822 0.0117
#> 9 M No Yes 0.00329 0.00329
#> 10 M Yes No 0.0756 0.0117
#> 11 M Yes Yes 0.00329 0.00329
#> 12 E No <NA> 0.193 0.0318
#> 13 E Yes <NA> 0.521 0.0318
#> 14 H No <NA> 0.0829 0.00812
#> 15 H Yes <NA> 0.0390 0.00812
#> 16 M No <NA> 0.0855 0.0117
#> 17 M Yes <NA> 0.0789 0.0117
#> 18 E <NA> No 0.585 0.0276
#> 19 E <NA> Yes 0.128 0.0276
#> 20 H <NA> No 0.119 0.00244
#> 21 H <NA> Yes 0.00244 0.00244
#> 22 M <NA> No 0.158 0.00460
#> 23 M <NA> Yes 0.00657 0.00460
#> 24 <NA> No No 0.334 0.0338
#> 25 <NA> No Yes 0.0271 0.0129
#> 26 <NA> Yes No 0.529 0.0382
#> 27 <NA> Yes Yes 0.110 0.0258
#> 28 E <NA> <NA> 0.714 0
#> 29 H <NA> <NA> 0.122 0
#> 30 M <NA> <NA> 0.164 0
#> 31 <NA> No <NA> 0.361 0.0349
#> 32 <NA> Yes <NA> 0.639 0.0349
#> 33 <NA> <NA> No 0.863 0.0280
#> 34 <NA> <NA> Yes 0.137 0.0280
#> 35 <NA> <NA> <NA> 1 0
And I think it makes sense to unpeel the ones outside of an interact like this:
# Or mix of interacted and not?
# dstrata %>%
# group_by(interact(stype, awards), yr.rnd) %>%
# summarize(x = survey_mean())
# Just these?
# - interact(stype, awards) yr.rnd
# - interact(stype, awards)
# - stype
# - awards
# - None
# dstrata %>%
# group_by(yr.rnd, interact(stype, awards)) %>%
# summarize(x = survey_mean())
# Just these?
# - yr.rnd interact(stype, awards)
# - yr.rnd stype
# - yr.rnd awards
# - yr.rnd
# - None
Possibly should add a .groupings
argument to cascade so that users can override:
# dstrata %>%
# cascade(x = survey_mean(), .groupings = list(interact(stype, awards), NULL))
# - interact(stype, awards)
# - None
# (should warn if used on a tbl_df that's already grouped, since existing groups will be ignored)
In SAS and SUDAAN, the order matters and something written like var3var2var1 would have var3 as a grouping (stratification) variable as seen in the output below. This might be more what you would expect to see in srvyr with group_by(var3, interact(var2, var1))
. SAS behaves the same way so not including that output.
SUDAAN code and abbreviated output:
proc crosstab data=apistrat design=strwr;
weight pw;
nest stype_num;
class stype_num awards_num yr_rnd_num;
table stype_num*awards_num*yr_rnd_num;
print totper setot/style=nchs;
run;
for: STYPE = Total.
------------------------------------------
AWARDS
YR_RND Tot SE Tot
Percent Percent
------------------------------------------
Total
Total 100.00 0.00
Yes 13.75 2.80
No 86.25 2.80
Yes
Total 63.89 3.49
Yes 11.04 2.58
No 52.86 3.82
No
Total 36.11 3.49
Yes 2.71 1.29
No 33.39 3.38
------------------------------------------
for: STYPE = E.
------------------------------------------
AWARDS
YR_RND Tot SE Tot
Percent Percent
------------------------------------------
Total
Total 100.00 0.00
Yes 18.00 3.86
No 82.00 3.86
Yes
Total 73.00 4.46
Yes 15.00 3.59
No 58.00 4.96
No
Total 27.00 4.46
Yes 3.00 1.71
No 24.00 4.29
------------------------------------------
for: STYPE = M.
------------------------------------------
AWARDS
YR_RND Tot SE Tot
Percent Percent
------------------------------------------
Total
Total 100.00 0.00
Yes 4.00 2.80
No 96.00 2.80
Yes
Total 48.00 7.14
Yes 2.00 2.00
No 46.00 7.12
No
Total 52.00 7.14
Yes 2.00 2.00
No 50.00 7.14
------------------------------------------
for: STYPE = H.
------------------------------------------
AWARDS
YR_RND Tot SE Tot
Percent Percent
------------------------------------------
Total
Total 100.00 0.00
Yes 2.00 2.00
No 98.00 2.00
Yes
Total 32.00 6.66
Yes 0.00 0.00
No 32.00 6.66
No
Total 68.00 6.66
Yes 2.00 2.00
No 66.00 6.77
------------------------------------------
Ha, yeah, I guess the 3 way interaction is a little overboard, but I think it makes sense within the grammar of srvyr. Without the concept of row/column/total percents, srvyr needs the different peeling behaviors to allow for the same calculations. Then as a consequence, srvyr allows for a "table-total" percent that total across tabs (which doesn't seem possible in SUDAAN).
Thanks for those examples!
Reported by @szimmer as part of #131