ProjectMOSAIC / mosaic

Project MOSAIC R package
http://mosaic-web.org/
93 stars 26 forks source link

check documentation/behavior of tally with format = "proportion" #606

Closed rpruim closed 8 years ago

rpruim commented 8 years ago

Currently the following give different results:

tally(sex ~ substance, data=HELPrct, format="proportion", margins = TRUE)
       substance
## sex         alcohol    cocaine     heroin      Total
##  female 0.07947020 0.09050773 0.06622517 0.23620309
##  male   0.31125828 0.24503311 0.20750552 0.76379691
##  Total  0.39072848 0.33554084 0.27373068 1.00000000

tally( ~ sex | substance, data=HELPrct, format="proportion", margins = TRUE)
##        substance
## sex        alcohol   cocaine    heroin
##  female 0.2033898 0.2697368 0.2419355
##  male   0.7966102 0.7302632 0.7580645
##  Total  1.0000000 1.0000000 1.0000000

But a user pointed out that this is a change from earlier versions and doesn't match the documentation.

luebby commented 8 years ago

On the other hand - compare Vignette LessVolume-MoreCreativity, section Numerical Summaries: Two Variables

All do the same thing.

mean(age ~ substance,data=HELPrct)
##  alcohol  cocaine   heroin 
## 38.19774 34.49342 33.44355
mean( ~ age | substance, data=HELPrct)
##  alcohol  cocaine   heroin 
## 38.19774 34.49342 33.44355

And somehow inconsistent (?) to

bargraph( ~ sex | substance, type="proportion", data=HELPrct)

unnamed-chunk-2-1

nicholasjhorton commented 8 years ago

I previously had assumed that

tally(sex ~ substance, data=HELPrct, format="proportion", margins = TRUE)

and

tally( ~ sex | substance, data=HELPrct, format="proportion", margins = TRUE)

yielded the same results.

I would have tried to generate the former behavior by:

tally(~ sex + substance, format="proportion")

rpruim commented 8 years ago

Regarding bargraph(), it might be time to refactor this anyway. It doesn't even use tally(). Rather it uses xtabs() and passes things along to barchart(). I'll have to look and see whether refactoring here makes sense and how easy it is to get barchart() to do what we want.

What is our desired output of

bargraph( ~ sex | substance, type="proportion", data=HELPrct)
rpruim commented 8 years ago

moving bargraph() part of this to its own issue (#607).

nicholasjhorton commented 8 years ago

I would think three sets of two proportions (each of which add to 1 within each of the substance groups).

On Jun 28, 2016, at 10:01 PM, Randall Pruim notifications@github.com wrote:

Regarding bargraph(), it might be time to refactor this anyway. It doesn't even use tally(). Rather it uses xtabs() and passes things along to barchart(). I'll have to look and see whether refactoring here makes sense and how easy it is to get barchart() to do what we want.

What is our desired output of

bargraph( ~ sex | substance, type="proportion", data=HELPrct) — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

Nicholas Horton Professor of Statistics Department of Mathematics and Statistics, Amherst College PO Box 5000, AC #2239 Amherst, MA 01002-5000 https://www.amherst.edu/people/facstaff/nhorton

rpruim commented 8 years ago

I've done a bit of work on tally().

It now "promotes" formulas (like many of our other functions) so that a ~ b, ~ a | b and ~ a, groups = b all end up being the same thing thing internally at the point where the table is created. This means that a ~ b is not considered as "tally a conditional on b" and proportions will sum to 1 for each level of b.

Proportions are computed so they sum to one for each level of the condition.

Margins are added (when requested) for each non-conditional dimension.

For formulas of the form ~ rhs, no conditioning is done. (This avoids silly tables where all the proportions are 1, and the marginal totals get sillier from there.) In other words, the right hand side is a "condition" only if (a) there is a left hand side, and (b) there is not another condition (coming from | or groups).

This seems reasonably consistent and comprehensible.

Examples:

tally( ~ sex + substance, data = HELPrct, margins = TRUE, format = "proportion")
##         substance
## sex      alcohol cocaine heroin  Total
##   female  0.0795  0.0905 0.0662 0.2362
##   male    0.3113  0.2450 0.2075 0.7638
##   Total   0.3907  0.3355 0.2737 1.0000

tally( sex ~ substance, data = HELPrct, margins = TRUE, format = "proportion")
##         substance
## sex      alcohol cocaine heroin
##   female   0.203   0.270  0.242
##   male     0.797   0.730  0.758
##   Total    1.000   1.000  1.000

tally( sex ~ substance | homeless, data = HELPrct, margins = TRUE, format = "proportion")
## , , homeless = homeless
## 
##         substance
## sex      alcohol cocaine heroin  Total
##   female  0.0957  0.0670 0.0287 0.1914
##   male    0.3971  0.2153 0.1962 0.8086
##   Total   0.4928  0.2823 0.2249 1.0000
## 
## , , homeless = housed
## 
##         substance
## sex      alcohol cocaine heroin  Total
##   female  0.0656  0.1107 0.0984 0.2746
##   male    0.2377  0.2705 0.2172 0.7254
##   Total   0.3033  0.3811 0.3156 1.0000

tally( sex ~ substance + homeless, data = HELPrct, margins = TRUE, format = "proportion")
## , , homeless = homeless
## 
##         substance
## sex      alcohol cocaine heroin
##   female   0.194   0.237  0.128
##   male     0.806   0.763  0.872
##   Total    1.000   1.000  1.000
## 
## , , homeless = housed
## 
##         substance
## sex      alcohol cocaine heroin
##   female   0.216   0.290  0.312
##   male     0.784   0.710  0.688
##   Total    1.000   1.000  1.000
rpruim commented 8 years ago

The original examples above are now equivalent:

tally(sex ~ substance, data=HELPrct, format="proportion", margins = TRUE)
##         substance
## sex      alcohol cocaine heroin
##   female   0.203   0.270  0.242
##   male     0.797   0.730  0.758
##   Total    1.000   1.000  1.000
tally( ~ sex | substance, data=HELPrct, format="proportion", margins = TRUE)
##         substance
## sex      alcohol cocaine heroin
##   female   0.203   0.270  0.242
##   male     0.797   0.730  0.758
##   Total    1.000   1.000  1.000
rpruim commented 8 years ago

Also, some documentation has been added. In particular:

Details

The dplyr package also exports a tally function. If x inherits from class "tbl" or "data frame", then dplyr's tally() is called. This makes it easier to have the two packages coexist.

Otherwise, tally() is designed as an alternative to table() and xtabs(). The primary use case it to describe a (possibly multi-dimensional) table using a formula. For a table of counts, each component of the formala becomes one of the dimensions of the cross table. For tables of proportions or percents, conditional proportions and percents are computed, conditioned on each level of all "secondary" (i.e., conditioning) variables, defined as everything other than the left hand side, if there is a left hand side to the formala; and everything except the right hand side if the left hand side of the formula is empty. Note that groups is folded into the formula prior to this determination and becomes part of the conditioning.

When marginal totals are added, they are added for all of the conditioning dimensions, and proportions should sum to 1 for each level of the conditioning variables. This can be useful to make it clear which conditional proportions are beign computed.