Redesign of `stack` and `unstack`

bkamins commented 1 year ago

This issue is meant to replace: https://github.com/JuliaData/DataFrames.jl/issues/2215 https://github.com/JuliaData/DataFrames.jl/issues/2148 https://github.com/JuliaData/DataFrames.jl/issues/3066 https://github.com/JuliaData/DataFrames.jl/issues/2422 https://github.com/JuliaData/DataFrames.jl/issues/2414 https://github.com/JuliaData/DataFrames.jl/issues/1839

The proposed improved API for stack is:

stack(df::AbstractDataFrame,
    measure_vars,
    id_vars;
    variable_name=:variable,
    name_value=identity,
    value_name=:value, # or function
    variable_eltype::Type=String,
    fill=missing,
    view::Bool=false)

Questions to discuss:

should we rename measure_vars and id_vars arguments (I was thinking of something better but could not come with anything better);
variable_name - no change here
name_value - I propose to allow passing a function that takes measure_vars column names as strings and produces values in the variable_name column where the name of the measure_vars will be stored (I propose to have a single column still although e.g. dplyr allows multiple - column splitting can be performed as a later step - but maybe you will find it useful to allow for splitting in stack?; also the question is what name would be best here)
value_name - I propose to allow passing a function that takes measure_vars column names as strings and produces name of the column where the name of the values will be stored
fill if variable_name/value_name combination is missing what value use to fill data
do we need to provide an option to specify sorting order of produced data frame? (I now assume measure_vars columns are processed left to right)
view - true will be disallowed if variable_name or value_name is a function
if variable_name/value_name combination produces duplicate I assume we throw an error (but maybe we want some other behavior also?)

Example. Input df:

2×5 DataFrame
│ Row │ ID    │ varA2018 │ varA2019 │ varB2018 │ varB2019 │
│     │ Int64 │ Int64    │ Int64    │ Int64    │ Int64    │
├─────┼───────┼──────────┼──────────┼──────────┼──────────┤
│ 1   │ 1     │ 1        │ 1        │ 1        │ 1        │
│ 2   │ 2     │ 2        │ 2        │ 2        │ 2        │

Output of stack(df, r"var", :ID, variable_name= :Year, name_value=x -> parse(Int, last(x, 4)), value_name=x -> first(x, 4)):

4×4 DataFrame
│ Row │ ID    │ Year  │ varA  │ varB  │
│     │ Int64 │ Int64 │ Int64 │ Int64 │
├─────┼───────┼───────┼───────┼───────┤
│ 1   │ 1     │ 2018  │ 1     │ 1     │
│ 2   │ 1     │ 2019  │ 1     │ 1     │
│ 3   │ 2     │ 2018  │ 2     │ 2     │
│ 4   │ 2     │ 2019  │ 2     │ 2     │

(so the general idea is to allow for dynamic generation of variable_name and value_name based on measure_vars column names)

The proposed improved API for unstack is:

unstack(df::AbstractDataFrame,
      row_keys,
      col_keys,
      values;
      renamecols::Function=(x...) -> join(x, "_"),
      allowmissing::Bool=false,
      combine=values isa ColumnIndex ? only : (x...) -> only.(x),
      fill=missing,
      threads::Bool=true)

Questions to discuss:

rename rowkeys to row_keys
rename colkey to col_keys; start allowing passing multiple columns as col_keys
rename value to values; start allowing passing multiple columns as values
renamecols: takes as many positional arguments as there are col_keys columns; by default joins them with _
combine: takes as many positional arguments as there are values columns; by default if a single column is passed only, and if multiple a tuple of captured values is produced (but anything can be computed and returned here). Internally it will be values => combine transformation in operation specification syntax. Note that I propose, as opposed to, e.g. dplyr, that we do not create multiple columns here, but instead a value in a cell if a function of multiple columns (by default a tuple of matching values)

Example. Input df:

8×5 DataFrame
 Row │ id     n1      n2      v1     v2
     │ Int64  String  String  Int64  Char
─────┼────────────────────────────────────
   1 │     1  a       x           1  a
   2 │     1  a       y           2  b
   3 │     1  b       x           3  c
   4 │     1  b       y           4  d
   5 │     2  a       x           5  e
   6 │     2  a       y           6  f
   7 │     2  b       x           7  g
   8 │     2  b       y           8  h

Output of unstack(df, :id, r"n", r"v") (with default renamescols and combine):

2×5 DataFrame
 Row │ id     a_x       a_y       b_x       b_y
     │ Int64  Tuple…    Tuple…    Tuple…    Tuple…
─────┼───────────────────────────────────────────────
   1 │     1  (1, 'a')  (2, 'b')  (3, 'c')  (4, 'd')
   2 │     2  (5, 'e')  (6, 'f')  (7, 'g')  (8, 'h')

Output of unstack(df, :id, r"n", r"v", renamecols=string, combine=(x,y) -> string(x[1], y[1])):

2×5 DataFrame
 Row │ id     ax      ay      bx      by
     │ Int64  String  String  String  String
─────┼───────────────────────────────────────
   1 │     1  1a      2b      3c      4d
   2 │     2  5e      6f      7g      8h

bkamins commented 1 year ago

CC @nalimilan @pdeffebach @jar @jkrumbiegel

jonas-schulze commented 1 year ago

In your stack example,

stack(df, r"var", :ID, variable_name= x -> last(x, 4), value_name=x -> first(x, 4))
# note: first(x, 4)

how did the Year column get its name?

bkamins commented 1 year ago

Ah - right. These are problems, when one designs before implementing. It then should be something like:

stack(df, r"var", :ID, variable_name= :Year, name_value=x -> last(x, 4), value_name=x -> first(x, 3))

I will update the post. Thank you for spotting.

MatthewRGonzalez commented 1 year ago

In the example,

stack(df, r"var", :ID, variable_name= :Year, name_value=x -> last(x, 4), value_name=x -> first(x, 3))

why do you use var_name = x -> first(x,3)) and not var_name = x -> first(x,4)) to produce column names varA and varB.

bkamins commented 1 year ago

Again - typo. Fixed. It should be first(x, 4). I was writing the expression from my head (not tested). I want to first get a general agreement that what I propose is OK and sufficient because the implementation will heavily depend on the design.

The most important decisions affecting the design are:

for stack: are we OK to have only one :variable_name column (does anyone need multiple variable name columns in practice?)
for unstack: are we OK that we always store the result of multiple values columns in one cell (i.e. not creating multiple columns). The benefit of this is that we can combine them with a function (as in the example); the downside is that if someone wants multiple columns in the end then and additional select operation is needed later that will unnest the produced column (which seems easy, but maybe we feel that it is crucial to provide such functionality in unstack)

MatthewRGonzalez commented 1 year ago

In my opinion:

Are we OK to have only one :variable_name column (does anyone need multiple variable name columns in practice?)

Yes-- I would prefer to keep it simple. It's easy enough to split columns later on.

Are we OK that we always store the result of multiple values columns in one cell (i.e. not creating multiple columns). The benefit of this is that we can combine them with a function (as in the example);

I'd prefer this as well. I think the ability to combine the values with a function is useful.

nalimilan commented 1 year ago

Sounds good. I don't have an opinion about supporting multiple columns. Do we have examples where it's useful in dplyr?

name_value - I propose to allow passing a function that takes measure_vars column names as strings and produces values in the variable_name column where the name of the measure_vars will be stored (I propose to have a single column still although e.g. dplyr allows multiple - column splitting can be performed as a later step - but maybe you will find it useful to allow for splitting in stack?; also the question is what name would be best here)

Regarding the argument name, having both value_name and name_value seems confusing to me. Maybe something like variable_name_transform or passing a tuple to the existing argument, like variable_name=(col, fun)?

jariji commented 1 year ago

I encourage interested people to look at tidyr's pivot_longer for the design and naming for inspiration.

bkamins commented 1 year ago

@jariji - I know pivot_longer. Given your comment I understand you feel that pivot_longer has a better design than the proposal above? (except names - where I agree that as usual we need a careful decision)

If this is the case can you comment on the advantages of pivot_longer design from your perspective? Thank you!

jariji commented 1 year ago

I'm still reviewing the above and I'm not sure what's better at this point, just wanted to make sure pivot_* was in the discussion.

jariji commented 1 year ago

Looking at the example above, you have name_value=x -> last(x, 4) which I expect to produce String values but then the generated Year column has eltype Int64. Is that a typo or intentional?

bkamins commented 1 year ago

Is that a typo or intentional?

It was a typo (I was just sketching the intention). I updated the example with parse call.

just wanted to make sure pivot_* was in the discussion.

💯 agreed. The dplyr underwent a huge redesign. I was thinking for 3 months what to propose (and I am still not sure what is best - especially naming of arguments). This is reflected in the comment by @nalimilan - we want something flexible and composable, but at the same time to avoid complexities that users will never use.

For example privot_longer and pivot_wider were designed to be reversible i.e. so that you can always call pivot_longer and pivot_wider to go to the starting point. And I understand this desire as it is indeed clean.

However, I thought that this lead to a very complex design (if you look at the documentation there are many cases and complex rules). I tried to propose something that is simpler but still covers all standard needs.

For example:

I did not propose column splitting in stack (and pivot_longer supports it) as it is I think rarely needed, and can be easily be done later by the user.
I proposed in unstack to always create a single column even for many values also (as opposed to creating multiple columns as pivot_wider does) the reason is:
- column splitting can be done later if needed
- it allows us in combine to perform computations on multiple columns (which can be handy at times)
- it introduced in dplyr the complexity of argument saying what should change faster (values of column names) + it is likely to produce quite wide tables, which in my experience are hard to grasp.

However, I am open for suggestions as indeed these are hard decisions.

pdeffebach commented 3 months ago

I ran into the issue of not being able to have multiple value columns in unstack today. I think this would be a great feature to have.

JuliaData / DataFrames.jl

Redesign of `stack` and `unstack` #3237