JuliaArrays / StructArrays.jl

Efficient implementation of struct arrays in Julia
https://juliaarrays.github.io/StructArrays.jl/
Other
322 stars 41 forks source link

Large number of allocations in 0.6.8 #228

Open aplavin opened 2 years ago

aplavin commented 2 years ago

The eager automatic conversion introduced in #227 (cc @timholy @piever) leads to many allocations when this conversion isn't actually needed. For example: StructArrays@0.6.7 - 25 allocations:

julia> using StructArrays, ArraysOfArrays

julia> A = StructArray(x=VectorOfVectors{Int}())

julia> @time for _ in 1:10^6 push!(A, (x=1:10,)) end
  0.307580 seconds (25 allocations: 11.269 MiB, 7.45% gc time)

StructArrays@0.6.8 - 1 million allocations:

julia> using StructArrays, ArraysOfArrays

julia> A = StructArray(x=VectorOfVectors{Int}())

julia> @time for _ in 1:10^6 push!(A, (x=1:10,)) end
  0.809896 seconds (1.00 M allocations: 293.601 MiB, 55.02% gc time)

Of course, these allocations go away if automatic conversion (https://github.com/JuliaArrays/StructArrays.jl/blob/1581d70090de650646277e4eee4e33794e55b9ac/src/utils.jl#L197-L199) is disabled by StructArrays.maybe_convert_elt(::Type{T}, vals::Tuple) where {T} = vals.

This change in 0.6.8 is highly breaking performance-wise. Can the new automatic conversion approach be revised and made less eager somehow?

timholy commented 2 years ago

ArraysOfArrays is fundamentally broken: https://github.com/JuliaArrays/ArraysOfArrays.jl/issues/2. That can cause all sorts of trouble.

aplavin commented 2 years ago

Interesting, you mean the x[1] <: eltype(x) not holding in ArraysOfArrays? Indeed, that's unfortunate, didn't notice before.

However, this issue is independent and will remain even when the ArraysOfArrays eltype is fixed so that x[1] <: eltype(x). ArrayOfArrays can push! any iterable, such as a range, directly: it's significantly more efficient than materializing the iterable to an Array. Meanwhile, StructArrays eagerly attempts to convert the passed value to the eltype of component arrays. I believe the latter is the actual issue, StructArrays shouldn't decide for all underlying arrays that an explicit conversion is needed.

piever commented 2 years ago

I am also realizing that things are problematic for array types where the result of indexing is some custom type that has a reference to data from the original array, For example the same happens with StringArray{WeakRefString} from WeakRefStrings:

julia> using WeakRefStrings

julia> sa = StringArray{WeakRefString}(["x", "y"])
2-element StringVector{WeakRefString}:
 "x"
 "y"

julia> push!(sa, "z")
3-element StringVector{WeakRefString}:
 "x"
 "y"
 "z"

julia> push!(sa, convert(eltype(sa), "z"))
ERROR: MethodError: no method matching WeakRefString(::String)

CategoricalArrays have the same issue: you can push! things that cannot be converted to the eltype of the array.

I imagine one would need to pick up between

  1. avoid the conversion in StructArrays, so pushing e.g. 1 to a StructArray{ComplexF64} goes back to erroring
  2. drop support for all array types where you can push! values that cannot be converted to the eltype

Neither choice seems great, but I don't really see a clean way out of this. Maybe something along the lines of #229 is a reasonable compromise, but even then things are only fixed when the entry is passed as a tuple or named tuple.

aplavin commented 1 year ago

The reason is that push! does not recurrently call push! if some columns are StructArrays, but the whole thing is unrolled, for performance reasons.

Maybe at some point this unrolling wouldn't be required? Maybe even at current julia versions? :) Not sure about exact benchmark, but this can be a way forward to fixing this issue.