Closed ericphanson closed 3 years ago
Ah yes, I thought this would probably end up biting us at some point. We're kind of lying with our schema for the table vs. what we actually return. To avoid allocating a full array on each getindex
, it currently just constructs a view into the underlying arrow data. We'll have to think of what the best approach is going forward.
If I'm understanding your comment right, that's not something new, right? But my example (and the non-minimal one too) works fine on Arrow 1.2, so it seems like a new issue here.
Yeah, I think we maybe used to allocate the full array, and it was changed to return a view; the code is here if you want to take a look/play around. I would approach this by first checking if we should just change the Tables.schema
that is returned for Arrow.Table
; if the schema matched what getindex
produced, Tables.rowtable
wouldn't get angry at us. Secondly/alternatively, I'd look at changing getindex
: either just allocating the full Vector
when indexing, or maybe trying to do something more efficient like returning an Arrow.Primitive
array for bitstypes, and Arrow.List
for strings.
Ok, thanks for the pointer!
Just as a note to myself / other Beacon users, this also hits Onda.Samples
(ref https://github.com/beacon-biosignals/Onda.jl/pull/68):
julia> segments_table = DataFrame(Arrow.Table("blah.arrow"))
ERROR: MethodError: Cannot `convert` an object of type
Onda.Samples{Base.ReshapedArray{Float64, 2, SubArray{Float64, 1, Arrow.Primitive{Float64, Vector{Float64}}, Tuple{UnitRange{Int64}}, true}, Tuple{}},Onda.SamplesInfo{String, Vector{String}, String, Float64, Float64, Int16, Float64}} to an object of type
Onda.Samples{Matrix{Float64},Onda.SamplesInfo{String, Vector{String}, String, Float64, Float64, var"#s157", Float64} where var"#s157"<:Union{Float32, Float64, Int16, Int32, Int64, Int8, UInt16, UInt32, UInt64, UInt8}}
Closest candidates are:
convert(::Type{T}, ::T) where T at essentials.jl:205
So the more I've thought about it, the more I think we should change the getindex
method on Arrow.List
to just return a full Vector
instead of a view. The user originally saved a vector and they probably want a full vector out, so we should respect that. We can potentially keep the view
there and just make sure that it gets converted to a full Vector on the StructTypes.construct
call (I'm pretty sure we have a method right now that passes a view through as a full vector). I can get to this pretty soon, unless someone else wants to take a stab at it.
Ok, haven't added tests yet, but I think this is all that's needed. @ericphanson, would you mind checking your use-case? If you're so inclined, provide a test I can add to my PR?
Thanks! I can confirm with Arrow#main
my original use-case works as does the Onda.Samples
one. I think the MWE from my original post could serve as a test-case too, although it looks like you added one to the PR already. Let me know if you'd like any others and I can try to make one.
I was trying to come up with a MWE for #199 and came across
julia> col = [(; b=1), missing]
2-element Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}:
(b = 1,)
missing
julia> table = [(;a = col)]
1-element Vector{NamedTuple{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}}:
(a = [(b = 1,), missing],)
julia> Tables.rowtable(Arrow.Table(Arrow.tobuffer(table)))
ERROR: TypeError: in new, expected Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}, got a value of type SubArray{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}, 1, Arrow.Struct{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}, Tuple{Arrow.Primitive{Int64, Vector{Int64}}}}, Tuple{UnitRange{Int64}}, true}
Stacktrace:
[1] macro expansion
@ ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:46 [inlined]
[2] _iterate(rows::Tables.NamedTupleIterator{Tables.Schema{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}}, st::Tuple{})
@ Tables ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:37
[3] iterate
@ ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:58 [inlined]
[4] iterate
@ ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:57 [inlined]
[5] copyto!(dest::Vector{NamedTuple{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}}, src::Tables.NamedTupleIterator{Tables.Schema{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}})
@ Base ./abstractarray.jl:843
[6] _collect
@ ./array.jl:608 [inlined]
[7] collect(itr::Tables.NamedTupleIterator{Tables.Schema{(:a,), Tuple{Vector{Union{Missing, NamedTuple{(:b,), Tuple{Int64}}}}}}, Tables.RowIterator{Tables.CopiedColumns{Arrow.Table}}})
@ Base ./array.jl:602
[8] rowtable(itr::Arrow.Table)
@ Tables ~/.julia/packages/Tables/uYJXY/src/namedtuples.jl:99
[9] top-level scope
@ REPL[41]:1
Here, I didn't mean to do table = [(; a= col)]
and had meant to do just table = (; a = col)
(which does work). However, I think the version that errors here should still work?
I thought it kind of looked like this issue again.
with Arrow.jl 1.3,