JuliaData / Feather.jl

Read and write feather files in pure Julia
https://juliadata.github.io/Feather.jl/stable
Other
109 stars 27 forks source link

Error with writing Feather on Julia 1.2+ #124

Open xiaodaigh opened 5 years ago

xiaodaigh commented 5 years ago

update Included coded to download

This error is not in the Julia 1.1 branch. I just read the Fannie Mae 2016Q1.txt performance file from here(you need to register to download it).

;wget http://rapidsai-data.s3-website.us-east-2.amazonaws.com/notebook-mortgage-data/mortgage_2000.tgz
;tar xzvf mortgage_2000.tgz

using CSV
smallest_file = sort(joinpath.("perf", readdir("perf")), by=x->filesize(x))[1]
@time a = CSV.read(smallest_file, header=false, delim = '|')
@time Feather.write("c:/data/a.feather", a)

which yield this error which I can reproduce on Julia 1.2 and 1.3-rc1 but not 1.1.1.

ERROR: MethodError: arrowformat(::CSV.Column{Missing,Missing}) is ambiguous. Candidates:
  arrowformat(x::AbstractArray{Union{Missing, T},1}) where T<:Dates.Date in Arrow at C:\Users\RTX2080\.julia\packages\Arrow\b4oSO\src\arrowvectors.jl:246
  arrowformat(x::AbstractArray{Union{Missing, T},1}) where T<:Dates.DateTime in Arrow at C:\Users\RTX2080\.julia\packages\Arrow\b4oSO\src\arrowvectors.jl:246
  arrowformat(x::AbstractArray{Union{Missing, T},1}) where T<:Dates.Time in Arrow at C:\Users\RTX2080\.julia\packages\Arrow\b4oSO\src\arrowvectors.jl:246
  arrowformat(x::AbstractArray{Union{Missing, J},1}) where J<:AbstractString in Arrow at C:\Users\RTX2080\.julia\packages\Arrow\b4oSO\src\arrowvectors.jl:246
Possible fix, define
  arrowformat(::AbstractArray{Missing,1})
Stacktrace:
 [1] getarrow(::CSV.Column{Missing,Missing}) at C:\Users\RTX2080\.julia\packages\Feather\R3KXg\src\sink.jl:40
 [2] #write#18(::String, ::String, ::typeof(Feather.write), ::IOStream, ::DataFrames.DataFrame) at C:\Users\RTX2080\.julia\packages\Feather\R3KXg\src\sink.jl:18
 [3] #write at .\none:0 [inlined]
 [4] #20 at C:\Users\RTX2080\.julia\packages\Feather\R3KXg\src\sink.jl:32 [inlined]
 [5] #open#312(::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}, ::typeof(open), ::getfield(Feather, Symbol("##20#21")){String,String,DataFrames.DataFrame}, ::String, ::Vararg{String,N} where N) at .\iostream.jl:375
 [6] open at .\iostream.jl:373 [inlined]
 [7] #write#19 at C:\Users\RTX2080\.julia\packages\Feather\R3KXg\src\sink.jl:31 [inlined]
 [8] write(::String, ::DataFrames.DataFrame) at C:\Users\RTX2080\.julia\packages\Feather\R3KXg\src\sink.jl:31
 [9] top-level scope at util.jl:156
caused by [exception 1]
IOError: symlink: operation not permitted (EPERM)
Stacktrace:
 [1] uv_error at .\libuv.jl:90 [inlined]
 [2] symlink(::String, ::String) at .\file.jl:800
 [3] probe_symlink_creation(::String) at C:\Users\RTX2080\.julia\packages\BinaryProvider\TcAwt\src\PlatformEngines.jl:121
 [4] #probe_platform_engines!#30(::Bool, ::typeof(BinaryProvider.probe_platform_engines!)) at C:\Users\RTX2080\.julia\packages\BinaryProvider\TcAwt\src\PlatformEngines.jl:175
 [5] probe_platform_engines! at C:\Users\RTX2080\.julia\packages\BinaryProvider\TcAwt\src\PlatformEngines.jl:169 [inlined]
 [6] __init__() at C:\Users\RTX2080\.julia\packages\BinaryProvider\TcAwt\src\BinaryProvider.jl:28
 [7] _include_from_serialized(::String, ::Array{Any,1}) at .\loading.jl:685
 [8] _require_search_from_serialized(::Base.PkgId, ::String) at .\loading.jl:765
 [9] _tryrequire_from_serialized(::Base.PkgId, ::UInt64, ::String) at .\loading.jl:700
 [10] _require_search_from_serialized(::Base.PkgId, ::String) at .\loading.jl:754
 [11] _require(::Base.PkgId) at .\loading.jl:990
 [12] require(::Base.PkgId) at .\loading.jl:911
 [13] require(::Module, ::Symbol) at .\loading.jl:906
ExpandingMan commented 5 years ago

This error seems to be because you are attempting to write a column of all missings. The Feather format does not support this. I'll certainly agree that we could use a better error message.

I would suspect that, for whatever reason (worse or better type inference?) it is not trying to write a column this way in 1.1.1. Can you help confirm that this is the case?

xiaodaigh commented 5 years ago

I prefer the 1.1 behaviour with an error message. Because sometimes ppl do have all missing in a column. Especially if they work with large datasets and want to read in chunks. The first may be all empty

xiaodaigh commented 5 years ago
using DataFrames, Feather
a = DataFrame(a = [missing])
Feather.write("c:/data/a.feather", a)
Feather.read("c:/data/a.feather")

You are right. In Julia 1.1 it reads it back as String. I think this is sensible default behaviour. Adding a warning message would be great!

It's rather poor user experience to have it fail and have no viable suggestions or alternatives.

Happy to help with a PR if you can point me to where it happens.

nalimilan commented 5 years ago

Maybe a better default would be to write a Union{Missing,Bool} column? That uses less memory than String.

ExpandingMan commented 5 years ago

I don't like the idea of having mysterious and arbitrary behavior just to circumvent what would be a perfectly reasonable error once we insert a clear, helpful error message. The type we'd choose to coerce this into working would be completely arbitrary. The resulting schema would also come as a surprise to any program looking to load the file: if for example you save something which was expected to be Union{String,Missing}, saved it as Union{Bool,Missing} (in the case that we choose the arbitrary type Bool) you could wind up with undefined behavior. At least if we had an error message everybody would be entirely clear about what's going on. It also would take very little effort on the part of the program writing out the file to convert the column to Union{T,Missing} where T is an appropriate type chosen for the circumstance.

I prefer to adhere to the philosophy of the Julia language itself and avoid any mysterious, magical behavior.

xiaodaigh commented 5 years ago

@ExpandingMan As long as we provide utility functions and or error messages that also show explicit solution e.g. "you can export this by giving it a concrete-type like this Feather.write("path_to.feather", convert_missing_to_str(df))".

Otherwise, the usability is really poor

ExpandingMan commented 5 years ago

We can have an error that says something like

Feather format does not support columns with element type `Missing`, possible solution is convert to `AbstractVector{Union{T,Missing}}` where `T` a supported element type.

I'm not opposed to having alternative functions or keyword arguments for writing in more permissive ways in principle, but I'm not sure this one issue in and of itself justifies it, and besides, I'm not sure its place would be in this package.

I'll try to patch this within the next couple of days. Thanks for confirming the 1.1 behavior.

xiaodaigh commented 5 years ago

If the function says a possible solution is something then for convenience it would be great to provide convenience functions for that.

xiaodaigh commented 5 years ago

Feather is pretty small. So you world prefer another package like FeatherUtils.jl for utilities like this?

ExpandingMan commented 5 years ago

All we are talking about is convert(AbstractVector{Union{Int,Missing}}, v), that hardly seems to call for a function in the API.

xiaodaigh commented 5 years ago

I see. What if there are hundreds of columns? It's a barrier for beginners, for sure.

Feather format does not support columns with element type `Missing`  possible solution is convert to `AbstractVector{Union{T,Missing}}` where `T` a supported element type.

For example `convert(AbstractVector{Union{Int,Missing}}, v)`. 

For convenience, you can call `missing_cols_to_string!(df)` to have the conversion to String automatically. 

Or `missing_cols_to_type!(T)`, e.g. `missing_cols_to_type!(Int)` to have all `Vector{Missing}` columns converted to `Vector{Missing, Int}`. 

The more barriers we remove, the greater the usability. For experienced Julia programmers, convert doesn't seem that hard, but it's a barrier. It's a concept that ideally, a user doesn't NEED to know to use DataFrames well. I guess it's a philosophy. But I think I represent those users who aren't that well-versed in Julia.

If the user wants to convert different columns to different types, then they can write their own.

matthieugomez commented 5 years ago

I also think usability is important for read/write packages. It would be nice to have a keyword argument such as missingtostring = true.

ExpandingMan commented 5 years ago

It seems pretty clear cut to me that this should throw an error, and that resolving all the ambiguities that come with this situation do not fall under the purview of this package:

  1. If this happens, it suggests that the user believes that the schema can be inferred from their table when this is not the case. Perhaps they want to resolve this only to write to a feather, or perhaps they want to change the part of their program which constructed this array with eltype Missing in the first place, who knows.
  2. There is no information in the column of all Missings itself, perhaps the user just wants to drop it when writing data.
  3. As I've already mentioned, if the column is to be kept, it's unclear what the element type of the resulting column in the metadata should be. Again, choosing an arbitrary element type purely for the sake of circumventing an error could lead to undefined behavior on the part of the program that's reading the feather.
  4. Even if a user elects to convert the column to have an eltype Union{T,Missing}, it's not entirely clear what the container type should be. The most appropriate choice would seem to be FillArray from FillArrays.jl, should we include that package as a dependency just to handle this edge case? (hint: no).

There are undoubtedly tons of other edge cases like this in which something is trying to be written which the feather metadata cannot quite accommodate and in which the alternatives are ambiguous. Rather than attempting to handle them all, we should look to provide informative errors so that people who will make far better decisions than our arbitrary guesses will decide what to do about it.

matthieugomez commented 5 years ago

Sorry I was unclear. I agree it should be an error. I just meant it would be nice to have a keyword argument to convert the Missing columns to something like Union{Missing, String} when it happens.