JuliaData / SplitApplyCombine.jl

Split-apply-combine strategies for Julia
Other
144 stars 15 forks source link

generalize groupby and mapmany types #54

Closed aplavin closed 1 year ago

aplavin commented 2 years ago

Use similar() to create proper array types instead of plain Vectors.

Also, a (minor?) fix to group().

aplavin commented 1 year ago

Many types of arrays exist that support similar but not push! or append!.

Are there actually types that return non-pushable arrays from similar? For example, both Base ranges and StaticArrays SVectors turn into plain Vectors:

julia> similar(1:5, 1)
1-element Vector{Int64}:
 0

julia> similar(SVector(1, 2), 1)
1-element Vector{Int64}:
 4

Is there some problem we are trying to solve with similar?

An example is shown in tests: with this PR, group and mapmany on StructArrays return StructArrays. Before, the result was a plain Vector.

Is it that some array types in the wild don't support emptymutable when they should?

emptymutable doesn't seem documented at all, and barely any package define methods for it (according to juliahub). So, emptymutable returns a plain Vector for basically all array types. For me, it looks like some internal function for Base, unlike similar.

andyferris commented 1 year ago

For example, both Base ranges and StaticArrays SVectors turn into plain Vectors

There's mitigating facts for those. Ranges don't support setindex! and cannot be the output of similar (or emptymutable). For SVector we can't tell if the size is statically-known integer or a runtime integer, and we don't want to pessimize generic code that uses similar but has never thought about StaticArrays by assuming the size is statically known (and therefore making the runtime case way slower than it needs to be). Of course a static array could never support emptymutable.

with this PR, group and mapmany on StructArrays return StructArrays

See https://github.com/JuliaArrays/StructArrays.jl/issues/238

emptymutable doesn't seem documented at all, and barely any package define methods for it (according to juliahub). So, emptymutable returns a plain Vector for basically all array types. For me, it looks like some internal function for Base, unlike similar.

I think this is a fair observation from the outside but I can't see that this was the intention. I raised https://github.com/JuliaLang/julia/issues/46230 - perhaps we can look at improving the situation.

aplavin commented 1 year ago

Still, are there any actual types whose similar output cannot be pushed to?

emptymutable would be more useful if it also allowed something else, something that cannot be just as easily done with similar. A pain point currently is that there's no way to create a compatible empty array from a type: both similar and emptymutable take values, not types. Due to lack of such a function,

Do you think something like emptymutable(T) is possible to achieve together with making it public API?

andyferris commented 1 year ago

Still, are there any actual types whose similar output cannot be pushed to?

Off the top of my head, DistributedArrays.jl DArray has this property. I've created more than one like this myself. The easiest example to imagine is a C-style array - no double indirection like C++ std::vector or Julia's Vector to allow for resizing. (I'm hoping one day we'll have simpler Buffer in core and we can build a non-resizable array with no indirection out of that.)

A pain point currently is that there's no way to create a compatible empty array from a type

Yeah I agree with all of that. I'd like to evolve these. I took a stab at this problem in Dictionaries.jl - look for similar_type and empty_type. Feedback and contributions welcome. If the concept has merit, it would be good to look at improving Base.