Wondering about multiple Dependent variables

MichaelJMahometa commented 5 years ago

First, I love mosaic -- I've been transitioning to for HS students using R.

I use it also with an undergraduate regression course. In the past I've used something like describe() from psych to get a quick look at the descriptives for multiple variables:

names(GoosePermits)
vars <- c("bid","keep","sell")
library(psych)
describe(select(GoosePermits, one_of(vars)))

But, I'd really like to keep to mosaic as much as possible (and the tidyverse run out with piping if possible). Is if possible to get favstats() to produce a multiple variable table (summary for multiple variables at once)? Something like:

#This does NOT work:
favstats(bid + keep + sell ~ NULL, data=GoosePermits)

Any direction or advice is appreciated, Michael

rpruim commented 5 years ago

I'll have to give some thought to this. In principle, it should be possible to do, we just need to process the formula differently, loop over LHS variables, and decorate the output so it is clear what's what.

It might be easier to implement in df_stats()

rpruim commented 5 years ago

Proof of concept:

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)
##       _target_    Species min    Q1 median    Q3 max  mean        sd  n missing
## 1 Sepal.Length     setosa 4.3 4.800    5.0 5.200 5.8 5.006 0.3524897 50       0
## 2 Sepal.Length versicolor 4.9 5.600    5.9 6.300 7.0 5.936 0.5161711 50       0
## 3 Sepal.Length  virginica 4.9 6.225    6.5 6.900 7.9 6.588 0.6358796 50       0
## 4  Sepal.Width     setosa 2.3 3.200    3.4 3.675 4.4 3.428 0.3790644 50       0
## 5  Sepal.Width versicolor 2.0 2.525    2.8 3.000 3.4 2.770 0.3137983 50       0
## 6  Sepal.Width  virginica 2.2 2.800    3.0 3.175 3.8 2.974 0.3224966 50       0

rpruim commented 5 years ago

To do list

Update documentation and examples.
Decide on name for column indicating LHS (currently _response_, see below).
Decide whether that column should be included even if only one LHS variable is processed.
Decide how to deal with custom statistics and naming. (Long names, the current default, don't work well here.)

Some options for the last item:

change default from long_names = TRUE to long_names = FALSE. (If we keepresponse` even when there is only one response, this doesn't really lose any information.)
Introduce long_names = "default" and handle the two cases differently.

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
##     _response_    Species  mean        sd
## 1 Sepal.Length     setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length  virginica 6.588 0.6358796
## 4  Sepal.Width     setosa 3.428 0.3790644
## 5  Sepal.Width versicolor 2.770 0.3137983
## 6  Sepal.Width  virginica 2.974 0.3224966

rpruim commented 5 years ago

Regarding doing this for favstats()...

The code there is not as clean, so it would be harder to implement.
The output format is not a data frame, so it is less clear where to record the response variable.

I'm inclined to do this for df_stats() only at this point.

rpruim commented 5 years ago

@nicholasjhorton : Any thoughts about naming? We want to avoid using a name that might be among the names of the variables in the data set. Using underscore makes things harder to use downstream, however.

Perhaps we could use response as long as response is not in the names of the data and use _response_ otherwise.

rpruim commented 5 years ago

When processing multiple response expressions, long_names will be set to FALSE. We could additionally make that the default for a single response, but that would result in a change in behavior for old code.

Some examples:

## df_stats(Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
## ##      response    Species  mean        sd
## ## 1 Sepal.Width     setosa 3.428 0.3790644
## ## 2 Sepal.Width versicolor 2.770 0.3137983
## ## 3 Sepal.Width  virginica 2.974 0.3224966

df_stats(Sepal.Width ~ Species, data = iris, mean, sd)
##      response    Species mean_Sepal.Width sd_Sepal.Width
## 1 Sepal.Width     setosa            3.428      0.3790644
## 2 Sepal.Width versicolor            2.770      0.3137983
## 3 Sepal.Width  virginica            2.974      0.3224966

df_stats(Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
##      response    Species  mean        sd
## 1 Sepal.Width     setosa 3.428 0.3790644
## 2 Sepal.Width versicolor 2.770 0.3137983
## 3 Sepal.Width  virginica 2.974 0.3224966

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd)
##       response    Species  mean        sd
## 1 Sepal.Length     setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length  virginica 6.588 0.6358796
## 4  Sepal.Width     setosa 3.428 0.3790644
## 5  Sepal.Width versicolor 2.770 0.3137983
## 6  Sepal.Width  virginica 2.974 0.3224966

# long_names = TRUE is ignored in this situation
df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd, long_names = TRUE)
##       response    Species  mean        sd
## 1 Sepal.Length     setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length  virginica 6.588 0.6358796
## 4  Sepal.Width     setosa 3.428 0.3790644
## 5  Sepal.Width versicolor 2.770 0.3137983
## 6  Sepal.Width  virginica 2.974 0.3224966

rpruim commented 5 years ago

Updated to do list

[x] Documentation and examples
[ ] Should long_names = FALSE be the default? [Current code has TRUE as default]
[ ] Should response / _response_ be included in output when long_names = TRUE and there is only one response? [Current code includes.]

rpruim commented 5 years ago

Additional item: Need to consider what df_stats( ~ a + b, data = ... ) should do. Currently it is equivalent to a ~ b, but we could make it equivalent to a + b ~ 1.

rpruim commented 5 years ago

Here's POC for the change:

df_stats(~ Sepal.Length + Sepal.Width, data = iris)
##       response min  Q1 median  Q3 max     mean        sd   n missing
## 1 Sepal.Length 4.3 5.1    5.8 6.4 7.9 5.843333 0.8280661 150       0
## 2  Sepal.Width 2.0 2.8    3.0 3.3 4.4 3.057333 0.4358663 150       0

df_stats(~ Sepal.Length + Sepal.Width | Species, data = iris)
##       response    Species min    Q1 median    Q3 max  mean        sd  n missing
## 1 Sepal.Length     setosa 4.3 4.800    5.0 5.200 5.8 5.006 0.3524897 50       0
## 2 Sepal.Length versicolor 4.9 5.600    5.9 6.300 7.0 5.936 0.5161711 50       0
## 3 Sepal.Length  virginica 4.9 6.225    6.5 6.900 7.9 6.588 0.6358796 50       0
## 4  Sepal.Width     setosa 2.3 3.200    3.4 3.675 4.4 3.428 0.3790644 50       0
## 5  Sepal.Width versicolor 2.0 2.525    2.8 3.000 3.4 2.770 0.3137983 50       0
## 6  Sepal.Width  virginica 2.2 2.800    3.0 3.175 3.8 2.974 0.3224966 50       0

MichaelJMahometa commented 5 years ago

Would

df_stats(~ Sepal.Length + Sepal.Width | Species, data = iris)

have the equivalent:

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)

And, would

df_stats(~ Sepal.Length + Sepal.Width, data = iris)

have the equivalent:

df_stats(Sepal.Length + Sepal.Width ~ NULL, data = iris)

(thinking of equivalency with favstats() and mean() concepts in mosaic)

rpruim commented 5 years ago

Yes. Basically ~ rhs | cond gets converted into rhs ~ 1 | cond.

df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)
##       response    Species min    Q1 median    Q3 max  mean        sd  n missing
## 1 Sepal.Length     setosa 4.3 4.800    5.0 5.200 5.8 5.006 0.3524897 50       0
## 2 Sepal.Length versicolor 4.9 5.600    5.9 6.300 7.0 5.936 0.5161711 50       0
## 3 Sepal.Length  virginica 4.9 6.225    6.5 6.900 7.9 6.588 0.6358796 50       0
## 4  Sepal.Width     setosa 2.3 3.200    3.4 3.675 4.4 3.428 0.3790644 50       0
## 5  Sepal.Width versicolor 2.0 2.525    2.8 3.000 3.4 2.770 0.3137983 50       0
## 6  Sepal.Width  virginica 2.2 2.800    3.0 3.175 3.8 2.974 0.3224966 50       0

I'll need to do a bit more testing to make sure I didn't break anything, but this seems to be working as I intended.

rpruim commented 5 years ago

@MichaelJMahometa, If you want to try it out:

devtools::install_github("ProjectMOSAIC/mosaicCore", ref = "beta")

dtkaplan commented 5 years ago

I'd recommend against starting a name with underscore since, as you know, it requires back-ticks in many settings. Also, I'm against having the names of the output columns (as opposed to their values) differ depending on the names of variables in the input data frame. I don't think there's any real need, since "response" will be duplicated in the output only if the user creates such a name in the ... of the call to df_stats().

Why "response" and not "variable" or "name" or "variable_name"?

Do you want to allow a formula like . ~ Species to handle all of the variables?

rpruim commented 5 years ago

naming the response variable column

I'm not sure what the best name is. variable is perhaps too generic (Species is also a variable in the example above.) I'd like something that makes it clear that this is the thing the mean/sd/etc are computed OF. Do we have a word for that? I chose response because it sits in the "response slot" of the formula if you are thinking about models. But this is easy to change if we come up with something we like better.

I just modified the "backup name" to be response_var_. That avoids needing to escape and is unlikely to collide with things. (But as you say, response is also not likely to collide with names in the output data frame, so this is just some extra caution and not likely to be a behavior many users see.)

long vs short names for summaries

Sounds like your vote is for long_names = FALSE. Especially if there is a column containing the response variable name, I think I'm happy with that (even though it will be a change from previous versions).

expanding .

I thought about handling . on the left side but I haven't decided if we should.

Currently y ~ . works becausemodel.frame() takes care of the expansion for us and . ~ x does not -- just as in model.frame().

One wrinkle if we allow . ~ x is that if . expands to include both quantitative and categorical variables, the summaries will likely not be meaningful for some of the variables.

rpruim commented 5 years ago

Since it occurred to both of us, I decided to try implementing support for . ~ rhs. This can be abused with less than desirable results, but I guess there legitimate use cases.

Example:

df_stats(. ~ Species, data = iris, mean, sd)

##        response    Species  mean        sd
## 1  Sepal.Length     setosa 5.006 0.3524897
## 2  Sepal.Length versicolor 5.936 0.5161711
## 3  Sepal.Length  virginica 6.588 0.6358796
## 4   Sepal.Width     setosa 3.428 0.3790644
## 5   Sepal.Width versicolor 2.770 0.3137983
## 6   Sepal.Width  virginica 2.974 0.3224966
## 7  Petal.Length     setosa 1.462 0.1736640
## 8  Petal.Length versicolor 4.260 0.4699110
## 9  Petal.Length  virginica 5.552 0.5518947
## 10  Petal.Width     setosa 0.246 0.1053856
## 11  Petal.Width versicolor 1.326 0.1977527
## 12  Petal.Width  virginica 2.026 0.2746501

nicholasjhorton commented 5 years ago

I really like the . addition: nicely done!

On Sep 28, 2019, at 9:45 PM, Randall Pruim notifications@github.com wrote:

Since it occurred to both of us, I decided to try implementing support for . ~ rhs. This can be abused with less than desirable results, but I guess there legitimate use cases.

Example:

df_stats(. ~ Species, data = iris, mean, sd )

response Species mean sd

1 Sepal.Length setosa 5.006 0.3524897

2 Sepal.Length versicolor 5.936 0.5161711

3 Sepal.Length virginica 6.588 0.6358796

4 Sepal.Width setosa 3.428 0.3790644

5 Sepal.Width versicolor 2.770 0.3137983

6 Sepal.Width virginica 2.974 0.3224966

7 Petal.Length setosa 1.462 0.1736640

8 Petal.Length versicolor 4.260 0.4699110

9 Petal.Length virginica 5.552 0.5518947

10 Petal.Width setosa 0.246 0.1053856

11 Petal.Width versicolor 1.326 0.1977527

12 Petal.Width virginica 2.026 0.2746501

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nicholasjhorton commented 5 years ago

I like this proposal.

On Sep 28, 2019, at 8:56 PM, Randall Pruim notifications@github.com wrote:

I just modified the "backup name" to be responsevar.

rpruim commented 4 years ago

Looks like this got left on a development branch and didn't get merged into master. I guess I should fix that ;-)

rpruim commented 4 years ago

Looks like I need to fix some tests that are written assuming the old behavior.

rpruim commented 4 years ago

Tests adjusted (in mosaicCore) to match new behavior.

ProjectMOSAIC / mosaic