Open baumgold opened 9 months ago
In Arrow.Table all columns are stored in a Vector{AbstractVector}. This causes downstream type instability problems and performance problems when iterating over a single column.
Arrow.Table
Vector{AbstractVector}
julia> using Arrow, Tables julia> buf = Arrow.tobuffer((a=[1,2,3], b=[4,5,6])); julia> tt = Arrow.Table(buf) Arrow.Table with 3 rows, 2 columns, and schema: :a Int64 :b Int64 julia> @code_warntype Tables.getcolumn(tt, :a) MethodInstance for Tables.getcolumn(::Arrow.Table, ::Symbol) from getcolumn(t::Arrow.Table, nm::Symbol) @ Arrow ~/.julia/packages/Arrow/ID4np/src/table.jl:369 Arguments #self#::Core.Const(Tables.getcolumn) t::Arrow.Table nm::Symbol Body::AbstractVector 1 ─ %1 = Arrow.lookup(t)::Dict{Symbol, AbstractVector} │ %2 = Base.getindex(%1, nm)::AbstractVector └── return %2
This uses Julia v1.10 and Arrow v2.7.1.
julia> versioninfo() Julia Version 1.10.0 Commit 3120989f39b (2023-12-25 18:01 UTC) Build Info: Official https://julialang.org/ release Platform Info: OS: Linux (x86_64-linux-gnu) CPU: 48 × Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz WORD_SIZE: 64 LIBM: libopenlibm LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512) Threads: 5 on 48 virtual cores Environment: JULIA_NUM_THREADS = 4
In
Arrow.Table
all columns are stored in aVector{AbstractVector}
. This causes downstream type instability problems and performance problems when iterating over a single column.This uses Julia v1.10 and Arrow v2.7.1.