JuliaData / SplitApplyCombine.jl

Split-apply-combine strategies for Julia
Other
144 stars 15 forks source link

Allow `group` to take in an `AbstractVector` of groups? #30

Open pdeffebach opened 3 years ago

pdeffebach commented 3 years ago

Something like

g = [1, 1, 2, 2]
x = [5, 6, 7, 8]
group(g, x)
andyferris commented 3 years ago

Yes I think this is a good idea, though we need to be careful that dispatch works out.

I also thought we might have had something like this? (Perhaps it’s the internal function).

pdeffebach commented 3 years ago

I don't feel that strongly about it. It was just a surprising omission because without this there is no exact equivelent to a tapply call from R

andyferris commented 3 years ago

Hi @pdeffebach,

I finally got some time at the computer and see we already have this behavior:

julia> g = [1, 1, 2, 2]
4-element Array{Int64,1}:
 1
 1
 2
 2

julia> x = [5, 6, 7, 8]
4-element Array{Int64,1}:
 5
 6
 7
 8

julia> group(g, x)
2-element Dictionaries.Dictionary{Int64,Array{Int64,1}}
 1 │ [5, 6]
 2 │ [7, 8]

Is this what you were expecting?

andyferris commented 3 years ago

Regarding R's tapply if you want to apply fun to each group you can do fun.(group(g, x)) (or sometimes fun.(groupview(g, x)) might be faster/less memory hungry, and there is always groupreduce like groupreduce(+, g, x)).

pdeffebach commented 3 years ago

Thanks for this.

One final question, is there a version of this for transform? I.e. "spread"-ing the result across a vector the same length as the inputs?

I've been doing data cleaning at the repl and not having to write out a full groupby... transform call in data frames would be nice

andyferris commented 3 years ago

I'm not sure what you are seeking? Is it this?

julia> g = [1, 1, 2, 2]
4-element Array{Int64,1}:
 1
 1
 2
 2

julia> x = [5, 6, 7, 8]
4-element Array{Int64,1}:
 5
 6
 7
 8

julia> groups = group(g, x)
2-element Dictionaries.Dictionary{Int64,Array{Int64,1}}
 1 │ [5, 6]
 2 │ [7, 8]

julia> map(x -> groups[x], g)
4-element Array{Array{Int64,1},1}:
 [5, 6]
 [5, 6]
 [7, 8]
 [7, 8]
pdeffebach commented 3 years ago

Sorry for forgetting about this thread. I think the infrastructure has almost what I want, but I would like this to be in one function (The package is called SplitApplyCombine after all)

julia> using Statistics, SplitApplyCombine;

julia> function applyby(f, g::AbstractVector, x::AbstractVector)
           groups = group(g, x)
           map(f, groups)
       end
applyby (generic function with 1 method)

julia> applyby(mean, [1, 1, 2, 2], [5, 6, 7, 8])
2-element Dictionaries.Dictionary{Int64, Float64}
 1 │ 5.5
 2 │ 7.5

This would be nice to have. For reference, my motivation is for supporting grouped operations inside DataFramesMeta's @with, where all columns are just the vectors, so we can't take advantage of any DataFrames machinery.

An added bonus on the above would be to allow multiple arguments, i.e. applyby(f, g, args...). Not sure how that would work but could be feasible.

aplavin commented 3 years ago

Out of general principles, it seems more optimal to have fewer general functions that easily compose (group + map in your example) compared to a larger number of specialized functions (applyby). I think this case would have an (almost) zero overhead if you use groupview instead of group. Maybe I'm missing something, but

map(mean, group([1, 1, 2, 2], [5, 6, 7, 8]))

already looks very short, intuitive and clear - when one knowns what map and group do.