apache / arrow-julia

Official Julia implementation of Apache Arrow
https://arrow.apache.org/julia/
Other
283 stars 60 forks source link

Type instability in getcolumn #499

Open baumgold opened 4 months ago

baumgold commented 4 months ago

In Arrow.Table all columns are stored in a Vector{AbstractVector}. This causes downstream type instability problems and performance problems when iterating over a single column.

julia> using Arrow, Tables

julia> buf = Arrow.tobuffer((a=[1,2,3], b=[4,5,6]));

julia> tt = Arrow.Table(buf)
Arrow.Table with 3 rows, 2 columns, and schema:
 :a  Int64
 :b  Int64

julia> @code_warntype Tables.getcolumn(tt, :a)
MethodInstance for Tables.getcolumn(::Arrow.Table, ::Symbol)
  from getcolumn(t::Arrow.Table, nm::Symbol) @ Arrow ~/.julia/packages/Arrow/ID4np/src/table.jl:369
Arguments
  #self#::Core.Const(Tables.getcolumn)
  t::Arrow.Table
  nm::Symbol
Body::AbstractVector
1 ─ %1 = Arrow.lookup(t)::Dict{Symbol, AbstractVector}
│   %2 = Base.getindex(%1, nm)::AbstractVector
└──      return %2

This uses Julia v1.10 and Arrow v2.7.1.

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 48 × Intel(R) Xeon(R) Gold 6136 CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake-avx512)
  Threads: 5 on 48 virtual cores
Environment:
  JULIA_NUM_THREADS = 4