TidierOrg / TidierData.jl

Tidier data transformations in Julia, modeled after the dplyr/tidyr R packages.
MIT License
86 stars 7 forks source link

Is `@group_by()` sorting rows by group? #99

Closed kdpsingh closed 4 months ago

kdpsingh commented 6 months ago

I saw that the TidierData output was sorted by group here: https://x.com/kdpsinghlab/status/1781691797086896331?s=46&t=PEX2NKcgCDoTOe3kHQFmDA

This used to be the norm but I thought we had removed the sorting to make it faster. Need to have another look.

drizk1 commented 6 months ago

I took a quick look at this, I think that it is not sorting, but rather this is due to the behavior of how it is ungrouping.

# relevant @groupby line
    # removed ;sort = true for speed reasons
    # this can cause a large number of allocations when the grouped variable is a string
    local df_output = groupby(df_copy, Cols($(grouping_exprs...)); sort = false)
df = DataFrame(id=["D", "B", "D", "A"], v=[5, 7, 4, 2])
transform(groupby(df, :id), :v => (x -> x .- mean(x)) => :v100, ungroup = false)

the output below when ungrouped is identical to TidierData.jl

GroupedDataFrame with 3 groups based on key: id
First Group (2 rows): id = "D"
 Row │ id      v      v100    
     │ String  Int64  Float64 
─────┼────────────────────────
   1 │ D           5      0.5
   2 │ D           4     -0.5
⋮
Last Group (1 row): id = "A"
 Row │ id      v      v100    
     │ String  Int64  Float64 
─────┼────────────────────────
   1 │ A           2      0.0
@chain df begin
    @group_by(id)
    @mutate(v100 = v - mean(v))
     #@ungroup
end
GroupedDataFrame with 3 groups based on key: id
First Group (2 rows): id = "D"
 Row │ id      v      v100    
     │ String  Int64  Float64 
─────┼────────────────────────
   1 │ D           5      0.5
   2 │ D           4     -0.5
⋮
Last Group (1 row): id = "A"
 Row │ id      v      v100    
     │ String  Int64  Float64 
─────┼────────────────────────
   1 │ A           2      0.0

However, when TidierData ungroups, it does so by using DataFrame(gdf), so the grouped dataframes are "stacked" back on each other rather than each row returning to their initial index, essentially ordering them, giving it the appearance of sorting. (i think)

@chain df begin
    @group_by(id)
    @mutate(v100 = v - mean(v))
    @ungroup
end
4×3 DataFrame
 Row │ id      v      v100    
     │ String  Int64  Float64 
─────┼────────────────────────
   1 │ D           5      0.5
   2 │ D           4     -0.5
   3 │ B           7      0.0
   4 │ A           2      0.0
transform(groupby(df, :id), :v => (x -> x .- mean(x)) => :v100, ungroup = true)
4×3 DataFrame
 Row │ id      v      v100    
     │ String  Int64  Float64 
─────┼────────────────────────
   1 │ D           5      0.5
   2 │ B           7      0.0
   3 │ D           4     -0.5
   4 │ A           2      0.0
Status `~/.julia/environments/v1.9/Project.toml`
  [fe2206b3] TidierData v0.15.2
kdpsingh commented 6 months ago

Perfect, thank you for looking into this.

I think this is easily fixable, and fixing this will actually fix a performance issue as well.

drizk1 commented 6 months ago

no problem. I think the solution might be as simple as changing the DataFrame(gdf) in @ungroup to transform(gdf, ungroup = true)

@chain df begin
    @group_by(id)
    @mutate(v100 = v - mean(v))
    transform(ungroup = true)
end
4×3 DataFrame
 Row │ id      v      v100    
     │ String  Int64  Float64 
─────┼────────────────────────
   1 │ D           5      0.5
   2 │ B           7      0.0
   3 │ D           4     -0.5
   4 │ A           2      0.0

Edit: I did some benchmarking here and this does not resolve the performance issue

kdpsingh commented 4 months ago

Thanks @drizk1 for catching this issue. This is now fixed in #107.

This can be further sped up with transform(ungroup = true, copycols = false), but I've opted to not use that approach because it would mean that the columns in the grouped and ungrouped data frames would be linked in memory, and altering one could alter the other.