apache / arrow-julia

Official Julia implementation of Apache Arrow
https://arrow.apache.org/julia/
Other
283 stars 60 forks source link

Deserialization as Vector{SubArray} breaks `push!` on DataFrame #506

Open maleadt opened 1 month ago

maleadt commented 1 month ago

I'm using Arrow v2.7.2 with DataFrames v1.6.1 on Julia 1.10, and am running into an issue that seems to stem from Arrow.jl deserializing my Vector{Vector{T}} columns as Vector{SubArray{...}}:

julia> using Arrow, DataFrames

julia> df = DataFrame(foo=Vector{Int}[]);

julia> push!(df, [[1,2,3]])
1×1 DataFrame
 Row │ foo
     │ Array…
─────┼───────────
   1 │ [1, 2, 3]

julia> Arrow.write("/tmp/test.arrow", df)
"/tmp/test.arrow"

julia> df2 = copy(DataFrame(Arrow.Table("/tmp/test.arrow")));

julia> typeof(df2.foo)
Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1})

This breaks certain push!es on the dataframe, which I haven't been able to reproduce in isolation, but which looks as follows:

MethodError: Cannot `convert` an object of type Vector{Int64} to an object of type SubArray{Int64, 1, Arrow.Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}

Stacktrace:
  [1] push!(a::Vector{SubArray{Int64, 1, Arrow.Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}}, item::Vector{Int64})
    @ Base ./array.jl:1118
  [2] _row_inserter!(df::DataFrame, loc::Int64, row::Tuple{String, Vector{Int64}, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, String, Bool, Bool, Bool, Vector{Int64}, Vector{Int64}, Vector{Int64}, String, String, Float64}, mode::Val{:push}, promote::Bool)
    @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/dataframe/insertion.jl:663
  [3] push!(df::DataFrame, row::Tuple{String, Vector{Int64}, Int64, Int64, Int64, Int64, Int64, Int64, Int64, Int64, String, Bool, Bool, Bool, Vector{Int64}, Vector{Int64}, Vector{Int64}, String, String, Float64})
    @ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/dataframe/insertion.jl:457

It's possible I'm doing something wrong; first time Arrow.jl user here.

ericphanson commented 1 month ago

The workaround is to ask DataFrames to copy the columns:

DataFrame(Arrow.Table("/tmp/test.arrow")); copycols=true)

The reason for the current behavior is:

(not saying it is ideal, just how/why we got here)

Moelf commented 1 month ago

From perspective of Arrow, a Vector{Vector{}} is stored as a content vector and an offset vector, similar to how https://github.com/JuliaArrays/ArraysOfArrays.jl works.

Now, if it actually used that, the push!() would have worked just fine, but instead Arrow.jl is doing something on its own.

Btw, if you're interested in a fully systematic way of dealing with Arrow-like schema, https://github.com/JuliaHEP/AwkwardArray.jl is something we're prototyping.

ericphanson commented 1 month ago

Now, if it actually used that, the push!() would have worked just fine, but instead Arrow.jl is doing something on its own.

I don't think that's really accurate, the issue isn't the layout-in-memory, it's that Arrow.Table's columns are deliberately immutable, since they are static view into the underlying bytes that back the table.

Moelf commented 1 month ago

When there's compression involved it won't be purely Mmaped. In general I agree, I'm saying if the resultant table uses that it would have worked. But likely out of the gate it's immutable however we implement it

ericphanson commented 1 month ago

Right, I'm not saying it's always mmap'd, that was an example, but I'm saying Arrow.Table always has immutable columns in the current design of this package

maleadt commented 1 month ago

Thanks for the quick comments!

The workaround is to ask DataFrames to copy the columns:

DataFrame(Arrow.Table("/tmp/test.arrow")); copycols=true)

Hmm, I don't see any effect of that here:

julia> typeof(df.foo)
Vector{Vector{Int64}} (alias for Array{Array{Int64, 1}, 1})

julia> Arrow.write("/tmp/test.arrow", df);
julia> df2 = DataFrame(Arrow.Table("/tmp/test.arrow"); copycols=true);

julia> typeof(df2.foo)
Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1})

The snippet you posted is a little ambiguous, but additionally calling copy or DataFrame with copycols=true (which seems like the default for copy anyway) doesn't help either:

julia> df2 = DataFrame(DataFrame(Arrow.Table("/tmp/test.arrow")); copycols=true);
julia> typeof(df2.foo)
Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1})

julia> df2 = copy(DataFrame(Arrow.Table("/tmp/test.arrow")); copycols=true);
julia> typeof(df2.foo)
Vector{SubArray{Int64, 1, Primitive{Int64, Vector{Int64}}, Tuple{UnitRange{Int64}}, true}} (alias for Array{SubArray{Int64, 1, Arrow.Primitive{Int64, Array{Int64, 1}}, Tuple{UnitRange{Int64}}, true}, 1})
ericphanson commented 1 month ago

oh I misunderstood, it's inside a nested vector. I guess copying those would do it?

df = DataFrame(Arrow.Table("/tmp/test.arrow"); copycols=true);
transform!(df, :foo => ByRow(copy) => :foo)