JuliaData / Missings.jl

Missing value support for Julia
Other
72 stars 19 forks source link

parse with Missing #61

Open bkamins opened 6 years ago

bkamins commented 6 years ago

I would like to put under consideration if we want to add methods to parse that accept Union{T, Missing} argument (or create a similar method). This would be something like tryparse but returning missing on failure.

I would think of something like:

nalimilan commented 6 years ago

Actually with https://github.com/JuliaLang/julia/pull/23642 tryparse will have return type Union{Some{T}, Void} (or maybe just Union{T, Void} depending on the outcome of discussions). I'm not sure we want to allow parse(Union{T, Missing}, ...) in addition to that. There have been lots of discussions about this kind of syntax around the introduction of tryparse in Base. One issue is that parse(Union{T, Missing}, x) could be syntax to parse "missing" as missing, numbers as Int, and throw and error in other cases.

bkamins commented 6 years ago

Close this issue then or keep for a reference?

nalimilan commented 6 years ago

I'd say it depends on whether you think this syntax is needed or not.

bkamins commented 6 years ago

I believe we need a function that returns missing if parsing fails. It does not have to be prase, actually tryparse is more natural, but now it returns Nullable (which behavior I earlier thought would be retained as computer scientist missingness).

So let me comment on a use case and you can decide which way to go (I have much less understanding of the whole ecosystem context). A typical scenario with DataFrames is to read some data which is messy but is supposed to contain integers. Assume that we already have it as ["1", "2", "?", "", "3", "unknown"].

A pure approach would be to handle the missing data carefully, but a practical workflow (and this is how R works) is to parse it as Vector{Int} and make all that is not an integer a missing.

Computer scientist now can do:

julia> tryparse.(Int, ["1", "2", "?", "", "3", "unknown"])
6-element Array{Nullable{Int64},1}:
 1
 2
 #NULL
 #NULL
 3
 #NULL

But data scientist has to do (or there is some simpler way?):

julia> [isnull(x) ? missing : get(x) for x in tryparse.(Int, ["1", "2", "?", "", "3", "unknown"])]
6-element Array{Any,1}:
 1
 2
  missing
  missing
 3
  missing

which not only is messy but also has wrong type (Any) of the produced array.

I would like to have this fixed (I understand that solving Any problem is an additional topic discussed in https://github.com/JuliaLang/julia/pull/24332).

nalimilan commented 6 years ago

I see your point. I'm not sure we want to follow R in all aspects, in particular we have the ability to distinguish an invalid value from a missing value, so it wouldn't be absurd to be stricter, and require an explicit operation to convert an unparsable value to a missing value. That could be done quite easily using replace (see https://github.com/JuliaLang/julia/pull/22324). Adding a function as you suggest also makes sense, but I'd rather wait for the design of tryparse to settle first (JuliaLang/julia#23642), we can easily extend it later.

bkamins commented 6 years ago

Thanks.

alyst commented 6 years ago

I also don't think parse() should return missing for invalid inputs.

But somewhat more conservative thing would be to allow

parse(Int, missing) = missing
#= or =# 
parse(Union{Int, Missing}, missing) = missing

I prefer the first alternative as the 2nd might imply: parse(Union{Int, Missing}, "missing") == missing.

Then it would be possible to parse.(Int, ::AbstractArray{Union{String, Missing}})

quinnj commented 6 years ago

The approach taken in CSV.jl, which @shashi and I have discussed generalizing to parsing functions in general is to have parse(Union{Int, Missing}, str) => Union{Int, Missing} and parse(Int, "missing") => throw(ParsingException()).

alyst commented 6 years ago

@quinnj Do you also throw ParsingException() on parse(Int, ::Missing)?

quinnj commented 6 years ago

I don't currently have a use-case for that in CSV as there's always guaranteed to be an input, though the input may be an empty string.

nalimilan commented 6 years ago

So you are using missing to indicate that parsing failed? Basically what tryparse does with Nullable() (and will soon do with nothing)? I think it'd prefer using parse(Union{Int, Void}, ...) for that, to distinguish clearly between "invalid" and "missing".

quinnj commented 6 years ago

@nalimilan was that in response to my comment or @alyst ?

nalimilan commented 6 years ago

To yours.

quinnj commented 6 years ago

No, the user has to provide, in the case of CSV, a "null" string and if the input matches this "null" string, then missing is returned. In the case of failure, e.g. non-numeric digits when parsing a number, an error is always thrown. That leaves try-catch behavior up to the caller, who would know what to do in case of failure anyway. I supposed you could keep a tryparse function that returned nothing instead of throwing.

pdeffebach commented 6 years ago

A user ran into this with Dates recently, where Dates throws an error on parsing missings.

https://github.com/JuliaLang/julia/issues/28570

I would recommend a keyword argument, but if a constructor relies on parsing, we have to trust that they forward keyword arguments as well.

nalimilan commented 6 years ago

Since Missing now lives in Base, this feature proposal should be discussed there.

xpe commented 5 years ago

@nalimilan Right. Has a proposal / discussion started over there? Let's link to it if so.

bkamins commented 5 years ago

Actually I would close it (although I know I have raised the issue in the first place).

The current meta-understanding of types in Base is the following AFAICT:

I would argue that tryparse follows this rule and returns nothing on error - as the proper value does not exist. And I would keep this approach for consistency. Of course this is a bit debatable as the distinction here is thin.

But maybe there are cases when would be convenient to return missing - if yes we could keep discussing it.

nalimilan commented 5 years ago

Yes, the case for parsing to Union{T,Missing} seems to be that you want a sentinel value (e.g. "missing", "" or "NA") to be converted to missing. But invalid values should not be converted to missing. I imagine we could have something like parse(Union{T, Missing}, x, missing="missing").

I'm not aware of any issue in Base about this currently.

bkamins commented 5 years ago

Yes - the sentinel approach could make sense. It could be also more general, e.g. allowing a Regex or a predicate.

scls19fr commented 5 years ago

Having a parse (or tryparse) function which can output missing will greatly improve workflow with DataFrame. See https://github.com/JuliaData/DataFrames.jl/issues/1854

bkamins commented 5 years ago

@scls19fr - as you can see from the discussion above the issue is not that simple. What syntax would you think would be the most convenient while consistent (the consistency issue was a major concern in the past).

pepelovesvim commented 4 years ago

I think one should output nothing when using parse/tryparse on missing/nothing data. Since it's tryparse is already outputting nothing on invalid strings. Especially since nothing is suppose to represent "does not exist", so parsing missing/nothing should give a result that "does not exist".