Open bkamins opened 1 year ago
CC @nalimilan @pdeffebach @jar @jkrumbiegel
In your stack
example,
stack(df, r"var", :ID, variable_name= x -> last(x, 4), value_name=x -> first(x, 4))
# note: first(x, 4)
how did the Year
column get its name?
Ah - right. These are problems, when one designs before implementing. It then should be something like:
stack(df, r"var", :ID, variable_name= :Year, name_value=x -> last(x, 4), value_name=x -> first(x, 3))
I will update the post. Thank you for spotting.
In the example,
stack(df, r"var", :ID, variable_name= :Year, name_value=x -> last(x, 4), value_name=x -> first(x, 3))
why do you use var_name = x -> first(x,3))
and not var_name = x -> first(x,4))
to produce column names varA
and varB
.
Again - typo. Fixed. It should be first(x, 4)
. I was writing the expression from my head (not tested). I want to first get a general agreement that what I propose is OK and sufficient because the implementation will heavily depend on the design.
The most important decisions affecting the design are:
stack
: are we OK to have only one :variable_name
column (does anyone need multiple variable name columns in practice?)unstack
: are we OK that we always store the result of multiple values
columns in one cell (i.e. not creating multiple columns). The benefit of this is that we can combine them with a function (as in the example); the downside is that if someone wants multiple columns in the end then and additional select
operation is needed later that will unnest the produced column (which seems easy, but maybe we feel that it is crucial to provide such functionality in unstack
)In my opinion:
Are we OK to have only one :variable_name column (does anyone need multiple variable name columns in practice?)
Yes-- I would prefer to keep it simple. It's easy enough to split columns later on.
Are we OK that we always store the result of multiple values columns in one cell (i.e. not creating multiple columns). The benefit of this is that we can combine them with a function (as in the example);
I'd prefer this as well. I think the ability to combine the values with a function is useful.
Sounds good. I don't have an opinion about supporting multiple columns. Do we have examples where it's useful in dplyr?
name_value
- I propose to allow passing a function that takesmeasure_vars
column names as strings and produces values in thevariable_name
column where the name of themeasure_vars
will be stored (I propose to have a single column still although e.g. dplyr allows multiple - column splitting can be performed as a later step - but maybe you will find it useful to allow for splitting instack
?; also the question is what name would be best here)
Regarding the argument name, having both value_name
and name_value
seems confusing to me. Maybe something like variable_name_transform
or passing a tuple to the existing argument, like variable_name=(col, fun)
?
I encourage interested people to look at tidyr's pivot_longer
for the design and naming for inspiration.
@jariji - I know pivot_longer
.
Given your comment I understand you feel that pivot_longer
has a better design than the proposal above? (except names - where I agree that as usual we need a careful decision)
If this is the case can you comment on the advantages of pivot_longer
design from your perspective? Thank you!
I'm still reviewing the above and I'm not sure what's better at this point, just wanted to make sure pivot_*
was in the discussion.
Looking at the example above, you have name_value=x -> last(x, 4)
which I expect to produce String
values but then the generated Year
column has eltype Int64
. Is that a typo or intentional?
Is that a typo or intentional?
It was a typo (I was just sketching the intention). I updated the example with parse
call.
just wanted to make sure
pivot_*
was in the discussion.
💯 agreed. The dplyr
underwent a huge redesign. I was thinking for 3 months what to propose (and I am still not sure what is best - especially naming of arguments). This is reflected in the comment by @nalimilan - we want something flexible and composable, but at the same time to avoid complexities that users will never use.
For example privot_longer
and pivot_wider
were designed to be reversible i.e. so that you can always call pivot_longer
and pivot_wider
to go to the starting point. And I understand this desire as it is indeed clean.
However, I thought that this lead to a very complex design (if you look at the documentation there are many cases and complex rules). I tried to propose something that is simpler but still covers all standard needs.
For example:
stack
(and pivot_longer
supports it) as it is I think rarely needed, and can be easily be done later by the user.unstack
to always create a single column even for many values also (as opposed to creating multiple columns as pivot_wider
does) the reason is:
However, I am open for suggestions as indeed these are hard decisions.
I ran into the issue of not being able to have multiple value
columns in unstack
today. I think this would be a great feature to have.
This issue is meant to replace: https://github.com/JuliaData/DataFrames.jl/issues/2215 https://github.com/JuliaData/DataFrames.jl/issues/2148 https://github.com/JuliaData/DataFrames.jl/issues/3066 https://github.com/JuliaData/DataFrames.jl/issues/2422 https://github.com/JuliaData/DataFrames.jl/issues/2414 https://github.com/JuliaData/DataFrames.jl/issues/1839
The proposed improved API for
stack
is:Questions to discuss:
measure_vars
andid_vars
arguments (I was thinking of something better but could not come with anything better);variable_name
- no change herename_value
- I propose to allow passing a function that takesmeasure_vars
column names as strings and produces values in thevariable_name
column where the name of themeasure_vars
will be stored (I propose to have a single column still although e.g. dplyr allows multiple - column splitting can be performed as a later step - but maybe you will find it useful to allow for splitting instack
?; also the question is what name would be best here)value_name
- I propose to allow passing a function that takesmeasure_vars
column names as strings and produces name of the column where the name of the values will be storedfill
ifvariable_name
/value_name
combination is missing what value use to fill datameasure_vars
columns are processed left to right)view
-true
will be disallowed ifvariable_name
orvalue_name
is a functionvariable_name
/value_name
combination produces duplicate I assume we throw an error (but maybe we want some other behavior also?)Example. Input
df
:Output of stack(df, r"var", :ID, variable_name= :Year, name_value=x -> parse(Int, last(x, 4)), value_name=x -> first(x, 4)):
(so the general idea is to allow for dynamic generation of
variable_name
andvalue_name
based onmeasure_vars
column names)The proposed improved API for
unstack
is:Questions to discuss:
rowkeys
torow_keys
colkey
tocol_keys
; start allowing passing multiple columns ascol_keys
value
tovalues
; start allowing passing multiple columns asvalues
renamecols
: takes as many positional arguments as there arecol_keys
columns; by default joins them with_
combine
: takes as many positional arguments as there arevalues
columns; by default if a single column is passedonly
, and if multiple a tuple of captured values is produced (but anything can be computed and returned here). Internally it will bevalues => combine
transformation in operation specification syntax. Note that I propose, as opposed to, e.g. dplyr, that we do not create multiple columns here, but instead a value in a cell if a function of multiple columns (by default a tuple of matching values)Example. Input
df
:Output of
unstack(df, :id, r"n", r"v")
(with defaultrenamescols
andcombine
):Output of
unstack(df, :id, r"n", r"v", renamecols=string, combine=(x,y) -> string(x[1], y[1]))
: