apache / arrow-julia

Official Julia implementation of Apache Arrow
https://arrow.apache.org/julia/
Other
285 stars 59 forks source link

Issue with `Tables.rowtable` when entries contain a vector of strings #167

Closed ericphanson closed 3 years ago

ericphanson commented 3 years ago

with Arrow.jl 1.3,

julia> using Arrow, Tables

julia> table = (; col1 = [["a"], ["a"]])
(col1 = [["a"], ["a"]],)

julia> Arrow.write("test.arrow", table)
"test.arrow"

julia> table2 = Arrow.Table("test.arrow")
Arrow.Table: (col1 = [["a"], ["a"]],)

julia> Tables.rowtable(table2)
ERROR: TypeError: in new, expected Vector{String}, got a value of type SubArray{String, 1, Arrow.List{String, Int32, Vector{UInt8}}, Tuple{UnitRange{Int64}}, true}
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/Tables/UxLRG/src/namedtuples.jl:46 [inlined]
 [2] _iterate(rows::Tables.NamedTupleIterator{Tables.Schema{(:col1,), Tuple{Vector{String}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}}, st::Tuple{})
   @ Tables ~/.julia/packages/Tables/UxLRG/src/namedtuples.jl:37
 [3] iterate
   @ ~/.julia/packages/Tables/UxLRG/src/namedtuples.jl:58 [inlined]
 [4] iterate
   @ ~/.julia/packages/Tables/UxLRG/src/namedtuples.jl:57 [inlined]
 [5] copyto!(dest::Vector{NamedTuple{(:col1,), Tuple{Vector{String}}}}, src::Tables.NamedTupleIterator{Tables.Schema{(:col1,), Tuple{Vector{String}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}})
   @ Base ./abstractarray.jl:843
 [6] _collect
   @ ./array.jl:608 [inlined]
 [7] collect(itr::Tables.NamedTupleIterator{Tables.Schema{(:col1,), Tuple{Vector{String}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}})
   @ Base ./array.jl:602
 [8] rowtable(itr::Arrow.Table)
   @ Tables ~/.julia/packages/Tables/UxLRG/src/namedtuples.jl:99
 [9] top-level scope
   @ REPL[27]:1
quinnj commented 3 years ago

Ah yes, I thought this would probably end up biting us at some point. We're kind of lying with our schema for the table vs. what we actually return. To avoid allocating a full array on each getindex, it currently just constructs a view into the underlying arrow data. We'll have to think of what the best approach is going forward.

ericphanson commented 3 years ago

If I'm understanding your comment right, that's not something new, right? But my example (and the non-minimal one too) works fine on Arrow 1.2, so it seems like a new issue here.

quinnj commented 3 years ago

Yeah, I think we maybe used to allocate the full array, and it was changed to return a view; the code is here if you want to take a look/play around. I would approach this by first checking if we should just change the Tables.schema that is returned for Arrow.Table; if the schema matched what getindex produced, Tables.rowtable wouldn't get angry at us. Secondly/alternatively, I'd look at changing getindex: either just allocating the full Vector when indexing, or maybe trying to do something more efficient like returning an Arrow.Primitive array for bitstypes, and Arrow.List for strings.

ericphanson commented 3 years ago

Ok, thanks for the pointer!

Just as a note to myself / other Beacon users, this also hits Onda.Samples (ref https://github.com/beacon-biosignals/Onda.jl/pull/68):

julia> segments_table = DataFrame(Arrow.Table("blah.arrow"))
ERROR: MethodError: Cannot `convert` an object of type 
  Onda.Samples{Base.ReshapedArray{Float64, 2, SubArray{Float64, 1, Arrow.Primitive{Float64, Vector{Float64}}, Tuple{UnitRange{Int64}}, true}, Tuple{}},Onda.SamplesInfo{String, Vector{String}, String, Float64, Float64, Int16, Float64}} to an object of type 
  Onda.Samples{Matrix{Float64},Onda.SamplesInfo{String, Vector{String}, String, Float64, Float64, var"#s157", Float64} where var"#s157"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}}
Closest candidates are:
  convert(::Type{T}, ::T) where T at essentials.jl:205
quinnj commented 3 years ago

So the more I've thought about it, the more I think we should change the getindex method on Arrow.List to just return a full Vector instead of a view. The user originally saved a vector and they probably want a full vector out, so we should respect that. We can potentially keep the view there and just make sure that it gets converted to a full Vector on the StructTypes.construct call (I'm pretty sure we have a method right now that passes a view through as a full vector). I can get to this pretty soon, unless someone else wants to take a stab at it.

quinnj commented 3 years ago

Ok, haven't added tests yet, but I think this is all that's needed. @ericphanson, would you mind checking your use-case? If you're so inclined, provide a test I can add to my PR?

https://github.com/JuliaData/Arrow.jl/pull/182

ericphanson commented 3 years ago

Thanks! I can confirm with Arrow#main my original use-case works as does the Onda.Samples one. I think the MWE from my original post could serve as a test-case too, although it looks like you added one to the PR already. Let me know if you'd like any others and I can try to make one.

ericphanson commented 3 years ago

I was trying to come up with a MWE for #199 and came across

julia> col = [(; b=1), missing]
2-element Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}:
 (b = 1,)
 missing

julia> table = [(;a = col)]
1-element Vector{NamedTuple{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}}:
 (a = [(b = 1,), missing],)

julia> Tables.rowtable(Arrow.Table(Arrow.tobuffer(table)))
ERROR: TypeError: in new, expected Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}, got a value of type SubArray{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}, 1, Arrow.Struct{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}, Tuple{Arrow.Primitive{Int64, Vector{Int64}}}}, Tuple{UnitRange{Int64}}, true}
Stacktrace:
 [1] macro expansion
   @ ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:46 [inlined]
 [2] _iterate(rows::Tables.NamedTupleIterator{Tables.Schema{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}}, st::Tuple{})
   @ Tables ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:37
 [3] iterate
   @ ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:58 [inlined]
 [4] iterate
   @ ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:57 [inlined]
 [5] copyto!(dest::Vector{NamedTuple{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}}, src::Tables.NamedTupleIterator{Tables.Schema{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}})
   @ Base ./abstractarray.jl:843
 [6] _collect
   @ ./array.jl:608 [inlined]
 [7] collect(itr::Tables.NamedTupleIterator{Tables.Schema{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}})
   @ Base ./array.jl:602
 [8] rowtable(itr::Arrow.Table)
   @ Tables ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:99
 [9] top-level scope
   @ REPL[41]:1

Here, I didn't mean to do table = [(; a= col)] and had meant to do just table = (; a = col) (which does work). However, I think the version that errors here should still work?

I thought it kind of looked like this issue again.