JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

What are the constraints on the types of data in a `DataFrame` for `Feather.write` to apply #141

Open thudjx opened 4 years ago

thudjx commented 4 years ago

I have a DataFrame as following:

julia> test_data
15×3 DataFrame
│ Row │ PetalLength │ PetalWidth │ Species    │
│     │ Float64     │ Float64    │ Cat…       │
├─────┼─────────────┼────────────┼────────────┤
│ 1   │ 1.6         │ 0.2        │ setosa     │
│ 2   │ 1.7         │ 0.3        │ setosa     │
│ 3   │ 1.6         │ 0.2        │ setosa     │
│ 4   │ 1.5         │ 0.1        │ setosa     │
│ 5   │ 1.4         │ 0.2        │ setosa     │
│ 6   │ 1.3         │ 0.2        │ setosa     │
│ 7   │ 1.5         │ 0.2        │ setosa     │
│ 8   │ 4.5         │ 1.5        │ versicolor │
│ 9   │ 4.9         │ 1.5        │ versicolor │
│ 10  │ 4.4         │ 1.2        │ versicolor │
│ 11  │ 5.9         │ 2.1        │ virginica  │
│ 12  │ 5.1         │ 2.0        │ virginica  │
│ 13  │ 6.0         │ 1.8        │ virginica  │
│ 14  │ 5.6         │ 2.4        │ virginica  │
│ 15  │ 5.2         │ 2.3        │ virginica  │

,where the type of :Species is CategoricalValue{String,UInt8}. Now I try to store it in a feather format and an error occurs,

julia> Feather.write("test_data.feather",test_data)
ERROR: type CategoricalPool has no field index
Stacktrace:
 [1] getproperty(::CategoricalPool{String,UInt8,CategoricalValue{String,UInt8}}, ::Symbol) at .\Base.jl:33
 [2] getlevels(::CategoricalArray{String,1,UInt8,String,CategoricalValue{String,UInt8},Union{}}) at C:\Users\dongjx\.julia\packages\Arrow\q3tEJ\src\dictencoding.jl:167
 [3] Arrow.DictEncoding(::CategoricalArray{String,1,UInt8,String,CategoricalValue{String,UInt8},Union{}}) at C:\Users\dongjx\.julia\packages\Arrow\q3tEJ\src\dictencoding.jl:68
 [4] arrowformat(::CategoricalArray{String,1,UInt8,String,CategoricalValue{String,UInt8},Union{}}) at C:\Users\dongjx\.julia\packages\Arrow\q3tEJ\src\arrowvectors.jl:242
 [5] getarrow(::CategoricalArray{String,1,UInt8,String,CategoricalValue{String,UInt8},Union{}}) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:37
 [6] write(::IOStream, ::DataFrame; description::String, metadata::String) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:18
 [7] #20 at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:32 [inlined]
 [8] open(::Feather.var"#20#21"{String,String,DataFrame}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at .\io.jl:298
 [9] open at .\io.jl:296 [inlined]
 [10] #write#19 at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:31 [inlined]
 [11] write(::String, ::DataFrame) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:31
 [12] top-level scope at REPL[36]:1

Well, let me convert the type of :Species:

test_data[!,:Species]=convert(Vector{Union{String,UInt8}},test_data[!,:Species])

and try to store it again, resulting in another error:

julia> Feather.write("test_data.feather",test_data)
ERROR: ArgumentError: cannot reinterpret `Union{UInt8, String}` `UInt8`, type `Union{UInt8, String}` is not a bits type
Stacktrace:
 [1] (::Base.var"#throwbits#203")(::Type{Union{UInt8, String}}, ::Type{UInt8}, ::Type{Union{UInt8, String}}) at .\reinterpretarray.jl:16
 [2] reinterpret(::Type{UInt8}, ::Array{Union{UInt8, String},1}) at .\reinterpretarray.jl:34
 [3] Arrow.Primitive(::Array{Union{UInt8, String},1}) at C:\Users\dongjx\.julia\packages\Arrow\q3tEJ\src\primitives.jl:48
 [4] arrowformat(::Array{Union{UInt8, String},1}) at C:\Users\dongjx\.julia\packages\Arrow\q3tEJ\src\arrowvectors.jl:242
 [5] getarrow(::Array{Union{UInt8, String},1}) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:37
 [6] write(::IOStream, ::DataFrame; description::String, metadata::String) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:18
 [7] #20 at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:32 [inlined]
 [8] open(::Feather.var"#20#21"{String,String,DataFrame}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at .\io.jl:298
 [9] open at .\io.jl:296 [inlined]
 [10] #write#19 at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:31 [inlined]
 [11] write(::String, ::DataFrame) at C:\Users\dongjx\.julia\packages\Feather\pbm3o\src\sink.jl:31
 [12] top-level scope at REPL[40]:1

So I try to convert the type into purely String:

julia> test_data[!,:Species]=convert(Vector{String},test_data[!,:Species])

and try again:

julia> Feather.write("test_data.feather",test_data)
"test_data.feather"

And it works!

But I still have a question here. Here my test_data is retrieved from RDataets.jl and is simple enough to transfer the type of :Species to a Array of String. But what if my data type is complex and I can't do this conversion? Furthermore, I have seen two scenarios that a DataFrame cannot be written into a .feather file. So what are the general constrains on the types in a DataFrame for it can apply Feather.write?

Thanks in advance.

dmbates commented 4 years ago

It appears that this issue is because the Arrow package needs to be updated for changes in CategoricalArrays. I think the Arrow structure for a CategoricalArray or a PooledArray should be DictEncoding. The code in Arrow is trying to use getlevels to, well, get the levels of the CategoricalArray, whereas now, according to DataAPI, I think it should use levels. @ExpandingMan Should this issue be transferred to the Arrow package?

dmbates commented 4 years ago

It appears as if https://github.com/ExpandingMan/Arrow.jl/pull/52 already addresses this issue.