JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Error writing DataFrame with empty string column #101

Closed claireh93 closed 6 years ago

claireh93 commented 6 years ago

I get an error when writing data frames that have an empty string in them. This could be related to #90? But it seems to be a much more likely problem! The error is similar on julia v0.6 (shown) and 1.0:

julia> using Feather

julia> using DataFrames

julia> test = DataFrame(a = ["hello"])
1×1 DataFrames.DataFrame
│ Row │ a     │
├─────┼───────┤
│ 1   │ hello │

julia> Feather.write("Test.feather", test)
Feather.Sink("Test.feather", Data.Schema:
rows: 1  cols: 1
Columns:
 "a"  String, Feather.Metadata.CTable("", 1, Feather.Metadata.Column[Feather.Metadata.Column("a", Feather.Metadata.PrimitiveArray(UTF8, PLAIN, 8, 1, 0, 16), 0, nothing, "")], 2, ""), IOStream(<file Test.feather>), "", "", Arrow.ArrowVector[String["hello"]])

julia> test = DataFrame(a = [""])
1×1 DataFrames.DataFrame
│ Row │ a │
├─────┼───┤
│ 1   │   │

julia> Feather.write("Test.feather", test)
ERROR: BoundsError: attempt to access 0-element Array{UInt8,1} at index [1]
Stacktrace:
 [1] throw_boundserror(::Array{UInt8,1}, ::Tuple{Int64}) at ./abstractarray.jl:434
 [2] checkbounds at ./abstractarray.jl:362 [inlined]
 [3] check_buffer_bounds(::Type{UInt8}, ::Array{UInt8,1}, ::Int64, ::Int64) at /Users/claireh/.julia/v0.6/Arrow/src/utils.jl:167
 [4] Type at /Users/claireh/.julia/v0.6/Arrow/src/primitives.jl:36 [inlined]
 [5] Arrow.Primitive(::Array{UInt8,1}) at /Users/claireh/.julia/v0.6/Arrow/src/primitives.jl:51
 [6] Type at /Users/claireh/.julia/v0.6/Arrow/src/lists.jl:113 [inlined]
 [7] Type at /Users/claireh/.julia/v0.6/Arrow/src/lists.jl:116 [inlined]
 [8] arrowformat at /Users/claireh/.julia/v0.6/Arrow/src/arrowvectors.jl:246 [inlined]
 [9] streamto!(::Feather.Sink, ::Type{DataStreams.Data.Column}, ::Array{String,1}, ::Int64, ::Int64) at /Users/claireh/.julia/v0.6/Feather/src/sink.jl:55
 [10] macro expansion at /Users/claireh/.julia/v0.6/DataStreams/src/query.jl:498 [inlined]
 [11] stream!(::DataFrames.DataFrame, ::DataStreams.Data.Query{0x01,Tuple{DataStreams.Data.QueryColumn{0x01,String,1,1,:a,nothing,()}},(),nothing,nothing}, ::Type{DataStreams.Data.Column}, ::Feather.Sink, ::DataStreams.Data.Schema{true,Tuple{String}}, ::Int64) at /Users/claireh/.julia/v0.6/DataStreams/src/query.jl:628
 [12] #stream!#123(::Bool, ::Function, ::DataFrames.DataFrame, ::DataStreams.Data.Query{0x01,Tuple{DataStreams.Data.QueryColumn{0x01,String,1,1,:a,nothing,()}},(),nothing,nothing}, ::Feather.Sink) at /Users/claireh/.julia/v0.6/DataStreams/src/query.jl:620
 [13] (::DataStreams.Data.#kw##stream!)(::Array{Any,1}, ::DataStreams.Data.#stream!, ::DataFrames.DataFrame, ::DataStreams.Data.Query{0x01,Tuple{DataStreams.Data.QueryColumn{0x01,String,1,1,:a,nothing,()}},(),nothing,nothing}, ::Feather.Sink) at ./<missing>:0
 [14] #stream!#121(::Bool, ::Dict{Int64,Function}, ::Function, ::Array{Any,1}, ::Void, ::Void, ::Array{Any,1}, ::Function, ::DataFrames.DataFrame, ::Feather.Sink) at /Users/claireh/.julia/v0.6/DataStreams/src/query.jl:579
 [15] write(::String, ::DataFrames.DataFrame) at /Users/claireh/.julia/v0.6/Feather/src/sink.jl:49
ExpandingMan commented 6 years ago

Yikes! This is not related to #90, but I think I may have caused it when I was fixing another problem. I can't believe we didn't have tests for this, on it!

ExpandingMan commented 6 years ago

Ok, so the reason this wasn't caught is because it was only happening when you create arrays with only empty strings. I have fixed this here. Once that's merged to Arrow I'll tag a release and it will fix this error. You won't need to update Feather itself, but be sure to do up Feather to make sure Arrow gets updated. If you want the fix sooner, you can clone Arrow master or the PR branch.

ExpandingMan commented 6 years ago

This has been fixed in an Arrow PR, but I haven't merged because for some reason I can't get travis to run.

ExpandingMan commented 6 years ago

Fixed in Arrow 0.2.3. Once the new tag is merged do up Feather and it should update the Arrow dependency with the fix.