JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Error when writing dataframe with a column of all missing values #86

Open jacobadenbaum opened 6 years ago

jacobadenbaum commented 6 years ago

When writing a dataframe that has a column of entirely missing values, Feather throws a MethodError.

julia> using Feather

julia> x = DataFrame(A=randn(10), B=missing)
10×2 DataFrames.DataFrame
│ Row │ A         │ B       │
├─────┼───────────┼─────────┤
│ 1   │ 0.248491  │ missing │
│ 2   │ -1.03116  │ missing │
│ 3   │ -0.413887 │ missing │
│ 4   │ -0.712535 │ missing │
│ 5   │ 0.879271  │ missing │
│ 6   │ -0.20734  │ missing │
│ 7   │ 0.0945996 │ missing │
│ 8   │ 0.47328   │ missing │
│ 9   │ -0.73724  │ missing │
│ 10  │ 2.49184   │ missing │

julia> Feather.write("test.feather", x)
ERROR: MethodError: Feather.feathertype(::Type{Union{}}) is ambiguous. Candidates:
  feathertype(::Type{#s84} where #s84<:AbstractString) in Feather at /Users/jacob/.julia/v0.6/Feather/src/Feather.jl:303
  feathertype(::Type{#s84} where #s84<:Feather.Arrow.Time) in Feather at /Users/jacob/.julia/v0.6/Feather/src/Feather.jl:300
Possible fix, define
  feathertype(::Type{Union{}})
Stacktrace:
 [1] close!(::Feather.Sink{DataFrames.DataFrameStream{Tuple{Array{Float64,1},Array{Missings.Missing,1}}}}) at /Users/jacob/.julia/v0.6/Feather/src/Feather.jl:471
 [2] #write#46(::Bool, ::Dict{Int64,Function}, ::Array{Any,1}, ::Function, ::String, ::DataFrames.DataFrame) at /Users/jacob/.julia/v0.6/Feather/src/Feather.jl:523
 [3] write(::String, ::DataFrames.DataFrame) at /Users/jacob/.julia/v0.6/Feather/src/Feather.jl:522
ExpandingMan commented 6 years ago

This isn't a bug. The Feather format supports missing values but columns still must have a type specified for non-missing values, even if all happen to be missing.

The simplest work around would be to do

df[:B] = convert(Vector{Union{Int,Missing}}, df[:B])

This will make it so that the "non-missing values" of that column will be Int (alternatively you can use any other data type that the Feather standard supports such as Float64 or String).

Perhaps we should provide a more convenient work around or, at the very least, a more user-friendly error message.

jacobadenbaum commented 6 years ago

That's essentially the exact solution I used to get around this, but it was tricky to find the problem since I was streaming to feather from a csv file that I hadn't examined yet, so I wouldn't have known that one of the columns was all missing ahead of time. I think it would be really nice to have Feather choose a sensible default so that it doesn't error out in unexpected ways. Would it be a terrible thing to define this as the default?

feathertype(::Type{Union{}}) = Union{Int, Missing}

The only downside I can think of would be if a user read in the dataframe from a file, was expecting to be able to set df[:B] = "test" and then discovered that they couldn't. But this is a use case where Feather would have errored earlier on anyway with the current implementation, so I don't think it would be too much of a problem.

ExpandingMan commented 6 years ago

I definitely don't want to choose some arbitrary type as default just to circumvent the error. We don't want to have Feather files sitting around where the data types come as a surprise to the users. To me the proper behavior here is definitely to throw an error.

I definitely sympathize with your problem though, I've also been in the situation many times where I am loading up or writing some horrible mess of a data set that someone threw at me and I have to figure out what in the world is causing errors. So we should definitely do a PR for an error. I think what should be done here is

feathertype(::Type{<:Any}) = throw(ArgumentError("unsupported type"))

I'm a little confused about why the type wound up being Union{} think that behavior might be gone in 0.7.

I'll probably make a PR for this sometime this week, otherwise feel free to make one yourself if you'd like.

jacobadenbaum commented 6 years ago

As best I can tell, what is going on is that feather calls to Missings.T to get the element type of the column, which (as best I can tell since it's not documented), strips out Missings.Missing from a type union. So when you call Missings.T(Missings.Missing), it returns Union{}.

Ideally, the error message would be thrown higher up in the stack so that it could tell you exactly which column it encountered the problem in, yes? Currently, the error is encountered when closing the file. It looks to me like one could change (at line 469)

for (i, name) in enumerate(header)
    ... # Code before
    # write out array values
    TT = Missings.T(eltype(arr))
    ... # Code after
end

to

for (i,name) in enumerate(header)
    ...  # Code before
    TT = Missings.T(eltype(arr))
    missingtype(TT) || begin 
        msg = "Unsupported type: Column $name cannot have type $(eltype(arr))"
        throw(ArgumentError(msg))
    end
    ... # Code after
end

Where missingtype is a new function to check whether or not a type is Missing:

missingtype(::Type) = false
missingtype(::Type{Missings.Missing}) = true

It shouldn't affect performance since it's just called once at the close of the file. Although I think maybe it would be better to throw this error at the beginning, rather than the end. I just am not quite familiar enough with the internals to know which function would be the right place to check for this.

jacobadenbaum commented 6 years ago

I'm happy to make a PR if this or something like it looks good.

ExpandingMan commented 6 years ago

Ok, I've had a chance to look at this a bit.

I think there's a bigger problem that this issue is a symptom of. Right now what happens is the following:

  1. A Data.Schema is created from the data source here.
  2. The columns of the data source are converted to arrow format here using Data.stream!.
  3. The Feather metadata flatbuffer is created from the Data.Schema.

What should happen is

  1. The columns of the data source get converted to the Arrow format.
  2. A schema is created from the set of Arrow formatted columns.
  3. The schema is converted to the Feather flatbuffer before any other writing takes place to ensure that this process can happen without errors.

This would do a lot more to ensure correctness since it would guarantee that you don't even get to the point of trying to sort out the metadata until it is known for certain that you can get proper Arrow formatted columns. This is also all another symptom of the fact that Feather doesn't seem to use any sort of standard Arrow metadata format, which would also simplify things significantly.

Making the changes I've suggested here would be non-trivial, and it would require adding brand new code for generating a Data.Schema from a Vector{ArrowVector}, which I'm still not completely convinced is the right way of doing it.

@quinnj any opinion on whether changing the steps as I've described above would be worth it?

Any way, I also just realized that you seem to be using the latest tag rather than master. The current master is a thorough overhaul of the whole package (which will be tagged some time after 0.7 stabilizes) so you might try cloning it and seeing what error you get. You should get an error during Data.stream! which might be a little easier to parse than what you showed above.