Open jacobadenbaum opened 6 years ago
This isn't a bug. The Feather format supports missing values but columns still must have a type specified for non-missing values, even if all happen to be missing.
The simplest work around would be to do
df[:B] = convert(Vector{Union{Int,Missing}}, df[:B])
This will make it so that the "non-missing values" of that column will be Int
(alternatively you can use any other data type that the Feather standard supports such as Float64
or String
).
Perhaps we should provide a more convenient work around or, at the very least, a more user-friendly error message.
That's essentially the exact solution I used to get around this, but it was tricky to find the problem since I was streaming to feather from a csv file that I hadn't examined yet, so I wouldn't have known that one of the columns was all missing ahead of time. I think it would be really nice to have Feather choose a sensible default so that it doesn't error out in unexpected ways. Would it be a terrible thing to define this as the default?
feathertype(::Type{Union{}}) = Union{Int, Missing}
The only downside I can think of would be if a user read in the dataframe from a file, was expecting to be able to set df[:B] = "test"
and then discovered that they couldn't. But this is a use case where Feather would have errored earlier on anyway with the current implementation, so I don't think it would be too much of a problem.
I definitely don't want to choose some arbitrary type as default just to circumvent the error. We don't want to have Feather files sitting around where the data types come as a surprise to the users. To me the proper behavior here is definitely to throw an error.
I definitely sympathize with your problem though, I've also been in the situation many times where I am loading up or writing some horrible mess of a data set that someone threw at me and I have to figure out what in the world is causing errors. So we should definitely do a PR for an error. I think what should be done here is
feathertype(::Type{<:Any}) = throw(ArgumentError("unsupported type"))
I'm a little confused about why the type wound up being Union{}
think that behavior might be gone in 0.7.
I'll probably make a PR for this sometime this week, otherwise feel free to make one yourself if you'd like.
As best I can tell, what is going on is that feather calls to Missings.T
to get the element type of the column, which (as best I can tell since it's not documented), strips out Missings.Missing
from a type union. So when you call Missings.T(Missings.Missing)
, it returns Union{}
.
Ideally, the error message would be thrown higher up in the stack so that it could tell you exactly which column it encountered the problem in, yes? Currently, the error is encountered when closing the file. It looks to me like one could change (at line 469)
for (i, name) in enumerate(header)
... # Code before
# write out array values
TT = Missings.T(eltype(arr))
... # Code after
end
to
for (i,name) in enumerate(header)
... # Code before
TT = Missings.T(eltype(arr))
missingtype(TT) || begin
msg = "Unsupported type: Column $name cannot have type $(eltype(arr))"
throw(ArgumentError(msg))
end
... # Code after
end
Where missingtype
is a new function to check whether or not a type is Missing
:
missingtype(::Type) = false
missingtype(::Type{Missings.Missing}) = true
It shouldn't affect performance since it's just called once at the close of the file. Although I think maybe it would be better to throw this error at the beginning, rather than the end. I just am not quite familiar enough with the internals to know which function would be the right place to check for this.
I'm happy to make a PR if this or something like it looks good.
Ok, I've had a chance to look at this a bit.
I think there's a bigger problem that this issue is a symptom of. Right now what happens is the following:
Data.Schema
is created from the data source here.Data.stream!
.Data.Schema
.What should happen is
This would do a lot more to ensure correctness since it would guarantee that you don't even get to the point of trying to sort out the metadata until it is known for certain that you can get proper Arrow formatted columns. This is also all another symptom of the fact that Feather doesn't seem to use any sort of standard Arrow metadata format, which would also simplify things significantly.
Making the changes I've suggested here would be non-trivial, and it would require adding brand new code for generating a Data.Schema
from a Vector{ArrowVector}
, which I'm still not completely convinced is the right way of doing it.
@quinnj any opinion on whether changing the steps as I've described above would be worth it?
Any way, I also just realized that you seem to be using the latest tag rather than master. The current master is a thorough overhaul of the whole package (which will be tagged some time after 0.7 stabilizes) so you might try cloning it and seeing what error you get. You should get an error during Data.stream!
which might be a little easier to parse than what you showed above.
When writing a dataframe that has a column of entirely missing values, Feather throws a
MethodError
.