Open MichaelJMahometa opened 5 years ago
I'll have to give some thought to this. In principle, it should be possible to do, we just need to process the formula differently, loop over LHS variables, and decorate the output so it is clear what's what.
It might be easier to implement in df_stats()
Proof of concept:
df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)
## _target_ Species min Q1 median Q3 max mean sd n missing
## 1 Sepal.Length setosa 4.3 4.800 5.0 5.200 5.8 5.006 0.3524897 50 0
## 2 Sepal.Length versicolor 4.9 5.600 5.9 6.300 7.0 5.936 0.5161711 50 0
## 3 Sepal.Length virginica 4.9 6.225 6.5 6.900 7.9 6.588 0.6358796 50 0
## 4 Sepal.Width setosa 2.3 3.200 3.4 3.675 4.4 3.428 0.3790644 50 0
## 5 Sepal.Width versicolor 2.0 2.525 2.8 3.000 3.4 2.770 0.3137983 50 0
## 6 Sepal.Width virginica 2.2 2.800 3.0 3.175 3.8 2.974 0.3224966 50 0
To do list
_response_
, see below).Some options for the last item:
long_names = TRUE
to long_names = FALSE. (If we keep
response` even when there is only one response, this doesn't really lose any information.)long_names = "default"
and handle the two cases differently.df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
## _response_ Species mean sd
## 1 Sepal.Length setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length virginica 6.588 0.6358796
## 4 Sepal.Width setosa 3.428 0.3790644
## 5 Sepal.Width versicolor 2.770 0.3137983
## 6 Sepal.Width virginica 2.974 0.3224966
Regarding doing this for favstats()
...
I'm inclined to do this for df_stats()
only at this point.
@nicholasjhorton : Any thoughts about naming? We want to avoid using a name that might be among the names of the variables in the data set. Using underscore makes things harder to use downstream, however.
Perhaps we could use response
as long as response is not in the names of the data and use _response_
otherwise.
When processing multiple response expressions, long_names
will be set to FALSE. We could additionally make that the default for a single response, but that would result in a change in behavior for old code.
Some examples:
## df_stats(Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
## ## response Species mean sd
## ## 1 Sepal.Width setosa 3.428 0.3790644
## ## 2 Sepal.Width versicolor 2.770 0.3137983
## ## 3 Sepal.Width virginica 2.974 0.3224966
df_stats(Sepal.Width ~ Species, data = iris, mean, sd)
## response Species mean_Sepal.Width sd_Sepal.Width
## 1 Sepal.Width setosa 3.428 0.3790644
## 2 Sepal.Width versicolor 2.770 0.3137983
## 3 Sepal.Width virginica 2.974 0.3224966
df_stats(Sepal.Width ~ Species, data = iris, mean, sd, long_names = FALSE)
## response Species mean sd
## 1 Sepal.Width setosa 3.428 0.3790644
## 2 Sepal.Width versicolor 2.770 0.3137983
## 3 Sepal.Width virginica 2.974 0.3224966
df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd)
## response Species mean sd
## 1 Sepal.Length setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length virginica 6.588 0.6358796
## 4 Sepal.Width setosa 3.428 0.3790644
## 5 Sepal.Width versicolor 2.770 0.3137983
## 6 Sepal.Width virginica 2.974 0.3224966
# long_names = TRUE is ignored in this situation
df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris, mean, sd, long_names = TRUE)
## response Species mean sd
## 1 Sepal.Length setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length virginica 6.588 0.6358796
## 4 Sepal.Width setosa 3.428 0.3790644
## 5 Sepal.Width versicolor 2.770 0.3137983
## 6 Sepal.Width virginica 2.974 0.3224966
Updated to do list
long_names = FALSE
be the default? [Current code has TRUE as default]response
/ _response_
be included in output when long_names = TRUE
and there is only one response? [Current code includes.]Additional item: Need to consider what df_stats( ~ a + b, data = ... )
should do. Currently it is equivalent to a ~ b
, but we could make it equivalent to a + b ~ 1
.
Here's POC for the change:
df_stats(~ Sepal.Length + Sepal.Width, data = iris)
## response min Q1 median Q3 max mean sd n missing
## 1 Sepal.Length 4.3 5.1 5.8 6.4 7.9 5.843333 0.8280661 150 0
## 2 Sepal.Width 2.0 2.8 3.0 3.3 4.4 3.057333 0.4358663 150 0
df_stats(~ Sepal.Length + Sepal.Width | Species, data = iris)
## response Species min Q1 median Q3 max mean sd n missing
## 1 Sepal.Length setosa 4.3 4.800 5.0 5.200 5.8 5.006 0.3524897 50 0
## 2 Sepal.Length versicolor 4.9 5.600 5.9 6.300 7.0 5.936 0.5161711 50 0
## 3 Sepal.Length virginica 4.9 6.225 6.5 6.900 7.9 6.588 0.6358796 50 0
## 4 Sepal.Width setosa 2.3 3.200 3.4 3.675 4.4 3.428 0.3790644 50 0
## 5 Sepal.Width versicolor 2.0 2.525 2.8 3.000 3.4 2.770 0.3137983 50 0
## 6 Sepal.Width virginica 2.2 2.800 3.0 3.175 3.8 2.974 0.3224966 50 0
Would
df_stats(~ Sepal.Length + Sepal.Width | Species, data = iris)
have the equivalent:
df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)
And, would
df_stats(~ Sepal.Length + Sepal.Width, data = iris)
have the equivalent:
df_stats(Sepal.Length + Sepal.Width ~ NULL, data = iris)
(thinking of equivalency with favstats()
and mean()
concepts in mosaic)
Yes. Basically ~ rhs | cond
gets converted into rhs ~ 1 | cond
.
df_stats(Sepal.Length + Sepal.Width ~ Species, data = iris)
## response Species min Q1 median Q3 max mean sd n missing
## 1 Sepal.Length setosa 4.3 4.800 5.0 5.200 5.8 5.006 0.3524897 50 0
## 2 Sepal.Length versicolor 4.9 5.600 5.9 6.300 7.0 5.936 0.5161711 50 0
## 3 Sepal.Length virginica 4.9 6.225 6.5 6.900 7.9 6.588 0.6358796 50 0
## 4 Sepal.Width setosa 2.3 3.200 3.4 3.675 4.4 3.428 0.3790644 50 0
## 5 Sepal.Width versicolor 2.0 2.525 2.8 3.000 3.4 2.770 0.3137983 50 0
## 6 Sepal.Width virginica 2.2 2.800 3.0 3.175 3.8 2.974 0.3224966 50 0
I'll need to do a bit more testing to make sure I didn't break anything, but this seems to be working as I intended.
@MichaelJMahometa, If you want to try it out:
devtools::install_github("ProjectMOSAIC/mosaicCore", ref = "beta")
I'd recommend against starting a name with underscore since, as you know, it requires back-ticks in many settings. Also, I'm against having the names of the output columns (as opposed to their values) differ depending on the names of variables in the input data frame. I don't think there's any real need, since "response" will be duplicated in the output only if the user creates such a name in the ... of the call to df_stats()
.
Why "response" and not "variable" or "name" or "variable_name"?
Do you want to allow a formula like . ~ Species
to handle all of the variables?
I'm not sure what the best name is. variable
is perhaps too generic (Species is also a variable in the example above.) I'd like something that makes it clear that this is the thing the mean/sd/etc are computed OF. Do we have a word for that? I chose response because it sits in the "response slot" of the formula if you are thinking about models. But this is easy to change if we come up with something we like better.
I just modified the "backup name" to be response_var_
. That avoids needing to escape and is unlikely to collide with things. (But as you say, response is also not likely to collide with names in the output data frame, so this is just some extra caution and not likely to be a behavior many users see.)
Sounds like your vote is for long_names = FALSE
. Especially if there is a column containing the response variable name, I think I'm happy with that (even though it will be a change from previous versions).
I thought about handling .
on the left side but I haven't decided if we should.
Currently y ~ .
works becausemodel.frame()
takes care of the expansion for us and . ~ x
does not -- just as in model.frame()
.
One wrinkle if we allow . ~ x
is that if .
expands to include both quantitative and categorical variables, the summaries will likely not be meaningful for some of the variables.
Since it occurred to both of us, I decided to try implementing support for . ~ rhs
. This can be abused with less than desirable results, but I guess there legitimate use cases.
Example:
df_stats(. ~ Species, data = iris, mean, sd)
## response Species mean sd
## 1 Sepal.Length setosa 5.006 0.3524897
## 2 Sepal.Length versicolor 5.936 0.5161711
## 3 Sepal.Length virginica 6.588 0.6358796
## 4 Sepal.Width setosa 3.428 0.3790644
## 5 Sepal.Width versicolor 2.770 0.3137983
## 6 Sepal.Width virginica 2.974 0.3224966
## 7 Petal.Length setosa 1.462 0.1736640
## 8 Petal.Length versicolor 4.260 0.4699110
## 9 Petal.Length virginica 5.552 0.5518947
## 10 Petal.Width setosa 0.246 0.1053856
## 11 Petal.Width versicolor 1.326 0.1977527
## 12 Petal.Width virginica 2.026 0.2746501
I really like the . addition: nicely done!
On Sep 28, 2019, at 9:45 PM, Randall Pruim notifications@github.com wrote:
Since it occurred to both of us, I decided to try implementing support for . ~ rhs. This can be abused with less than desirable results, but I guess there legitimate use cases.
Example:
df_stats(. ~ Species, data = iris, mean, sd )
response Species mean sd
1 Sepal.Length setosa 5.006 0.3524897
2 Sepal.Length versicolor 5.936 0.5161711
3 Sepal.Length virginica 6.588 0.6358796
4 Sepal.Width setosa 3.428 0.3790644
5 Sepal.Width versicolor 2.770 0.3137983
6 Sepal.Width virginica 2.974 0.3224966
7 Petal.Length setosa 1.462 0.1736640
8 Petal.Length versicolor 4.260 0.4699110
9 Petal.Length virginica 5.552 0.5518947
10 Petal.Width setosa 0.246 0.1053856
11 Petal.Width versicolor 1.326 0.1977527
12 Petal.Width virginica 2.026 0.2746501
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I like this proposal.
On Sep 28, 2019, at 8:56 PM, Randall Pruim notifications@github.com wrote:
I just modified the "backup name" to be responsevar.
Looks like this got left on a development branch and didn't get merged into master. I guess I should fix that ;-)
Looks like I need to fix some tests that are written assuming the old behavior.
Tests adjusted (in mosaicCore) to match new behavior.
First, I love mosaic -- I've been transitioning to for HS students using R.
I use it also with an undergraduate regression course. In the past I've used something like
describe()
from psych to get a quick look at the descriptives for multiple variables:But, I'd really like to keep to mosaic as much as possible (and the tidyverse run out with piping if possible). Is if possible to get
favstats()
to produce a multiple variable table (summary for multiple variables at once)? Something like:Any direction or advice is appreciated, Michael