JuliaData / SplitApplyCombine.jl

Split-apply-combine strategies for Julia
Other
149 stars 15 forks source link

Customize the Dict type for `group` #18

Open yurivish opened 5 years ago

yurivish commented 5 years ago

Hello, I'm using this great package to do basic data operations, and just found myself wanting to get back an OrderedDict from DataStructures.jl, since the order in which values appear in the iterable is significant.

I see there's some type promotion optimization going on in the group code, and I wonder if it would be possible to support passing in or otherwise specifying the output type in a way that preserves good type information.

Thanks for your work on this package!

andyferris commented 5 years ago

Thank you :)

Yes, this is an important problem. I've thought of only two things so far:

Note that we have the same general problem with functions like map, which only infers an output type from the input types, rather than anything else. (The solution offered by Base is obviously the first one above).

ssfrr commented 4 years ago

It's not clear to me why the output of group would be a dictionary, rather than a Vector{Tuple} or Vector{NamedTuple}. I get that the dictionary will be faster for random access of groups, but it seems like in the split-apply-combine workflow you end up iterating through all the groups anyways (caveat: I'm not very familiar with SAC so could be mistaken).

For instance, here's a data pipeline I just wrote (related to what I was trying to do in #22). In the 2nd line, the first thing I do after the group operation is to convert it into a Vector{Tuple}. The idea is to get something I can easily plot, so it needs to be sorted by the group key. I'm also using the @df macro from StatsPlots so I can refer to "columns" of my data. (ir_analysis is a Vector{NamedTuple} from a previous analysis step). My convention here is to use d for a whole dataset and r for each row.

rt60s = group(r->r.freq, r->r.rt60, ir_analysis) |>
    d -> map(tuple, collect(keys(d)), d) |>
    d -> sort(d; by=first) |>
    d -> map(d) do r
        n = count(!ismissing, r[2])
        m = n == 0 ? missing : mean(skipmissing(r[2]))
        (freq=r[1], count=n, mean=m)
    end

@df rt60s plot(:freq, :mean)

Maybe it's just my unfamiliarity with Dictionaries,jl preventing me from seeing the right way to do this - for me it's easier to work with familiar data types.

andyferris commented 4 years ago

Sorry for not responding to this earlier.

It's not clear to me why the output of group would be a dictionary, rather than a Vector{Tuple} or Vector{NamedTuple}. I get that the dictionary will be faster for random access of groups, but it seems like in the split-apply-combine workflow you end up iterating through all the groups anyways (caveat: I'm not very familiar with SAC so could be mistaken).

As an an intermediate step to create the groups, you need to create a dictionary to efficiently push the elements into the correct group. (The alternative way is to first sort the elements by the grouping function and then use Iterators.partition, but a thin wrapper over a sorted collection is a dictionary of groups). I recently moved group to return a Dictionaries.AbstractDictionary precisely because it iterates the same as a vector and is the intermediate value. Returning another data structure is more work, so my logic was the user could be responsible for that.

As to your example, there are several enhancements I want that should make your life easier.

  1. A sort-based group algorithm that returns a dictionary sorted by key.
  2. An AbstractDictionary-based table (instead of a AbstractVector-based table) that can wrap any such AbstractDictionary (or columns of AbstractDictionarys). The dictionary keys would be a "primary key" column and the dictionary values the other column(s).

For now, ignoring sorting by :freq we can do something like:

groups = groupview(r.freq, r.rt60);
freq = keys(groups)
count = (length ∘ skipmissing).(groups);
mean = (mean ∘ skipmissing).(groups);

plot(collect(freq), collect(mean))

However later I hope we can get these pre-sorted and in a table-like structure. :)

andyferris commented 4 years ago

count = (length ∘ skipmissing).(groups)

Apparenlty that doesn't work (yet). I got carried away: https://github.com/JuliaLang/julia/pull/35946, https://github.com/JuliaLang/julia/pull/35947