Open pdeffebach opened 3 years ago
Yes I think this is a good idea, though we need to be careful that dispatch works out.
I also thought we might have had something like this? (Perhaps it’s the internal function).
I don't feel that strongly about it. It was just a surprising omission because without this there is no exact equivelent to a tapply
call from R
Hi @pdeffebach,
I finally got some time at the computer and see we already have this behavior:
julia> g = [1, 1, 2, 2]
4-element Array{Int64,1}:
1
1
2
2
julia> x = [5, 6, 7, 8]
4-element Array{Int64,1}:
5
6
7
8
julia> group(g, x)
2-element Dictionaries.Dictionary{Int64,Array{Int64,1}}
1 │ [5, 6]
2 │ [7, 8]
Is this what you were expecting?
Regarding R's tapply
if you want to apply fun
to each group you can do fun.(group(g, x))
(or sometimes fun.(groupview(g, x))
might be faster/less memory hungry, and there is always groupreduce
like groupreduce(+, g, x)
).
Thanks for this.
One final question, is there a version of this for transform
? I.e. "spread"-ing the result across a vector the same length as the inputs?
I've been doing data cleaning at the repl and not having to write out a full groupby... transform
call in data frames would be nice
I'm not sure what you are seeking? Is it this?
julia> g = [1, 1, 2, 2]
4-element Array{Int64,1}:
1
1
2
2
julia> x = [5, 6, 7, 8]
4-element Array{Int64,1}:
5
6
7
8
julia> groups = group(g, x)
2-element Dictionaries.Dictionary{Int64,Array{Int64,1}}
1 │ [5, 6]
2 │ [7, 8]
julia> map(x -> groups[x], g)
4-element Array{Array{Int64,1},1}:
[5, 6]
[5, 6]
[7, 8]
[7, 8]
Sorry for forgetting about this thread. I think the infrastructure has almost what I want, but I would like this to be in one function (The package is called SplitApplyCombine after all)
julia> using Statistics, SplitApplyCombine;
julia> function applyby(f, g::AbstractVector, x::AbstractVector)
groups = group(g, x)
map(f, groups)
end
applyby (generic function with 1 method)
julia> applyby(mean, [1, 1, 2, 2], [5, 6, 7, 8])
2-element Dictionaries.Dictionary{Int64, Float64}
1 │ 5.5
2 │ 7.5
This would be nice to have. For reference, my motivation is for supporting grouped operations inside DataFramesMeta's @with
, where all columns are just the vectors, so we can't take advantage of any DataFrames machinery.
An added bonus on the above would be to allow multiple arguments, i.e. applyby(f, g, args...)
. Not sure how that would work but could be feasible.
Out of general principles, it seems more optimal to have fewer general functions that easily compose (group
+ map
in your example) compared to a larger number of specialized functions (applyby
). I think this case would have an (almost) zero overhead if you use groupview
instead of group
.
Maybe I'm missing something, but
map(mean, group([1, 1, 2, 2], [5, 6, 7, 8]))
already looks very short, intuitive and clear - when one knowns what map
and group
do.
Something like