JuliaData / DataFrames.jl

In-memory tabular data in Julia
https://dataframes.juliadata.org/stable/
Other
1.73k stars 367 forks source link

transform! on `GroupedDataFrame` reorders columns of the parent - do we want to keep this behavior? #2322

Closed bkamins closed 4 years ago

bkamins commented 4 years ago

In 0.21 we have:

julia> df = DataFrame(y = 1:4, x = ["b", "a", "b", "a"])
4×2 DataFrame
│ Row │ y     │ x      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ b      │
│ 2   │ 2     │ a      │
│ 3   │ 3     │ b      │
│ 4   │ 4     │ a      │

julia> transform!(groupby(df, :x), :y => identity => :y2)
4×3 DataFrame
│ Row │ x      │ y     │ y2    │
│     │ String │ Int64 │ Int64 │
├─────┼────────┼───────┼───────┤
│ 1   │ b      │ 1     │ 1     │
│ 2   │ a      │ 2     │ 2     │
│ 3   │ b      │ 3     │ 3     │
│ 4   │ a      │ 4     │ 4     │

which means that grouping column goes first. The question is if we want to keep this behavior or insist that the original columns of df stay in their order in the parent of GroupedDataFrame?

CC @pdeffebach @matthieugomez

pdeffebach commented 4 years ago

They should definitely not be moved.

We all know that it is best practices not to refer to columns by their column number, but we also know that introductory users routinely use df[:, 5] when referring to a specific column.

Not re-organizing would probably cause bugs that are hard to track down, especially because they would be encountered by less experienced users.

Plus Stata does not re-order and neither does R

r$> df = tibble(x = runif(5), g = c(1, 1, 1, 2, 2))                             

r$> t = df %>% 
    group_by(g) %>% 
    mutate(y = x - mean(x)) %>% 
    ungroup()                                                                   

r$> t                                                                           
# A tibble: 5 x 3
       x     g       y
   <dbl> <dbl>   <dbl>
1 0.455      1 -0.0984
2 0.505      1 -0.0485
3 0.701      1  0.147 
4 0.0432     2 -0.457 
5 0.958      2  0.457 
bkamins commented 4 years ago

Thank you for a quick response. I classify it a bug as it also is related with e.g.:

julia> df = DataFrame(y = 1:4, x = ["b", "a", "b", "a"])
4×2 DataFrame
│ Row │ y     │ x      │
│     │ Int64 │ String │
├─────┼───────┼────────┤
│ 1   │ 1     │ b      │
│ 2   │ 2     │ a      │
│ 3   │ 3     │ b      │
│ 4   │ 4     │ a      │

julia> select(groupby(df, :x), ungroup=false)
GroupedDataFrame with 2 groups based on key: x
First Group (2 rows): x = "b"
│ Row │ x      │
│     │ String │
├─────┼────────┤
│ 1   │ b      │
│ 2   │ b      │
⋮
Last Group (2 rows): x = "a"
│ Row │ x      │
│     │ String │
├─────┼────────┤
│ 1   │ a      │
│ 2   │ a      │

julia> select!(groupby(df, :x), ungroup=false)
GroupedDataFrame with 2 groups based on key: Error showing value of type GroupedDataFrame{DataFrame}:
ERROR: BoundsError: attempt to access 1-element Array{Symbol,1} at index [[2]]

I will submit a patch soon.

bkamins commented 4 years ago

Unfortunately this is linked with https://github.com/JuliaData/DataFrames.jl/issues/2297, so it will take a bit more to fix and will require a minor release.