JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.58k stars 5.48k forks source link

Date and DateTime support for Missing is missing #28570

Closed schnirz closed 4 years ago

schnirz commented 6 years ago

Currently, there is no support for missing values when it comes to working with Date and DateTime.

For example, the Date constructor throws an error when called on a missing input value:

Date(missing, "YY-mm-dd")

MethodError: Cannot `convert` an object of type Missings.Missing to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.

Stacktrace:
 [1] Date(::Missings.Missing, ::String) at ./dates/types.jl:312
 [2] include_string(::String, ::String) at ./loading.jl:522

Similarly, computing the difference between two dates fails of one of them is missing:

Date(2018,8,10) - missing

MethodError: no method matching -(::Date, ::Missings.Missing)
Closest candidates are:
  -(::DateTime, ::Missings.Missing) at In[112]:8
  -(::Date, ::Base.Dates.Day) at dates/arithmetic.jl:78
  -(::Date, ::Base.Dates.Week) at dates/arithmetic.jl:76
  ...

Stacktrace:
 [1] include_string(::String, ::String) at ./loading.jl:522

Presumably, all unary and binary operators need to be overloaded, either in Base or in Dates. Any ideas and suggestions for how to address this issue would be more than welcome.

pdeffebach commented 6 years ago

It's worth noting that other constructors don't have overloads for missing

Int(missing)
>ERROR: MethodError: Cannot `convert` an object of type Missings.Missing to an object of type Int64
This may have arisen from a call to the constructor Int64(...),
since type constructors fall back to convert methods.

String(missing)
> ERROR: MethodError: Cannot `convert` an object of type Missings.Missing to an object of type String
This may have arisen from a call to the constructor String(...),
since type constructors fall back to convert methods.

So it is likely that the fix is to add an overload just to the - function and whatever other unary operators are defined for Dates, but leave the parsing up to the user.

schnirz commented 6 years ago

I think that there is a case to be made for overloading Date and DateTime constructors.

As an example, consider a user is trying to parse a an array or a column of a data frame containing strings as well as missing values to dates. Asking the user to explicitly handle missing values for this particular case while all other operations handle missing values quietly without the user having to explicitly deal with them would constitute an inconsistent user experience.

pdeffebach commented 6 years ago

Could you clarify which operations handle missing values quietly? For constructors. If, for example, you have an array ["4", "5", missing, "7"], you wouldn't have anything work automatically with parsing missings.

Deliberate handling of missings is something Julia makes you more sensitive to than in other languages. For instance this is (probably) recommended behavior.

df2 = @linq df |>
       transform(date = [ismissing(x) ? missing : Date(x) for x in :a])

This might be a pain, but there is a lot to be said for this really being a feature because it encourages people to be aware of how missing values propogate.

R constructors do propagate missing values however, so this is new for people coming over from there.

pdeffebach commented 6 years ago

But we should still work on a PR for the Unary operators!

pdeffebach commented 6 years ago

Interestingly, it looks like someone snuck in a bunch of broadcasting behavior that should probably throw errors in 1.0

https://github.com/JuliaLang/julia/blob/f8d82ccba81223c9d0d43e4077acb2a208bedc7a/stdlib/Dates/src/arithmetic.jl#L94

(-)(x::AbstractArray{T}, y::T) where {T<:TimeType} = x .- y
(-)(y::T, x::AbstractArray{T}) where {T<:TimeType} = y .- x

We might want to learn more about that before a PR to avoid having to write out so many overloadings of unary operators.

Posted on discourse here

nalimilan commented 6 years ago

I don't think we should have a special case for dates: either all constructors should accept missing, or none of them. Since it would be weird to have a constructor return an object not of the requested type (as @pdeffebach) noted, I don't think it would be appropriate to accept missing.

However, we should make it easier to enable functions/constructors to accept missing and return missing, an operation which is often called "lifting". See https://github.com/JuliaLang/julia/pull/26661.

pdeffebach commented 6 years ago

Could you clarify more about the functions issue? Should I file a distinct issue for + and - with dates, even if we don't touch the constructor? Or should we start idiomatically using lift?

It seems like a lift statement everywhere is at odds with the idea of missings propagating easily through basic operators.

nalimilan commented 6 years ago

Yes, + and - should (most probably) propagate missing for all types. Reopening so that we track this part of the original description.

schnirz commented 6 years ago

I'm still not 100% convinced that accepting missing values for the Date and DateTime constructors is problematic, since these constructors are used differently from most other constructors (for most types using the parse function seems to be the way to go from my experience).

I am, however, ok with using a lift statement/functor. I have, in fact, done so myself recently for dealing with missing input to DateTime and it worked just fine. It would seem necessary to provide a generic lift statement via Base.missing and to provide some documentation around this approach to explain to new users that this is the way you are supposed to handle missing values when a function/constructor doesn't explicitly handle them itself.

I also agree with pdeffebach that this approach, while feasible, is at odds with the promises made when missing was announced as the new de facto missing value type in Julia. As a matter of fact, the need to lift values was levelled as a complaint against Nullable or Option types and was presented as a motivation for the new missing type! (link)

KristofferC commented 6 years ago

As far as I recall, the argument was that a few basic operations would propagate missing but for anything other than that one would manually lift.

pdeffebach commented 6 years ago

As far as I recall, the argument was that a few basic operations would propagate missing

Without the convert and constructor change we might have been able to include parse and convert in our "list of basic functions" that propagate missing. However I think that overloading parse and convert wouldn't propagate seemlessly now, right?

I think lift is a good idea. Given that missing is in base, a ? syntax would be nice.

nalimilan commented 6 years ago

Without the convert and constructor change we might have been able to include parse and convert in our "list of basic functions" that propagate missing. However I think that overloading parse and convert wouldn't propagate seemlessly now, right?

I don't think the removal of the constructor fallback on convert changes anything. convert(Union{T, Missing}, x) should work fine, what we need is a short syntax to write Union{T,Missing}. Regarding parse, we could also define methods so that parse(Union{T,Missing}, x) works, possibly with an argument (keyword?) to specify which string(s) should be considered as missing. See discussion at https://github.com/JuliaLang/julia/pull/25131 and linked issue.

pdeffebach commented 6 years ago

Thanks for the feedback.

Taking Dates as the example here. Overloading parse only works If Date() exclusively calls parse and convert, and not _tryparse or other custom functions. Or we establish best practices where one should always use parse when creating dates. But that seems less than ideal.

But we can't expect to track down every intermediate function that a constructor calls if it isn't just parse all the way down. We would have to use lift or something similar.

schnirz commented 6 years ago

If lift is the way to go, then I suppose only the binary operators in Dates need to be overloaded in order to be consistent with binary operators defined in Base, correct?

Also, +1 for the ? syntax for lift!

nalimilan commented 6 years ago

Taking Dates as the example here. Overloading parse only works If Date() exclusively calls parse and convert, and not _tryparse or other custom functions. Or we establish best practices where one should always use parse when creating dates. But that seems less than ideal.

But we can't expect to track down every intermediate function that a constructor calls if it isn't just parse all the way down. We would have to use lift or something similar.

Yes, my idea is that people will need to use parse(Union{Date,Missing}, x) or lift(Date)(x) (depending on whether missing values are represented as "missing" or missing, respectively), not that the Date constructor would propagate missingness.

schnirz commented 6 years ago

I personally prefer lift(Date)(x) over parse(Union{Date,Missing}, x). I have never used parse for Date orDateTime before. Also, the documentation only discusses parsing Date and DateTime via their respective constructors, so expecting people to parse Date and DateTime using the parse function when missing values are a concern seems odd.

quinnj commented 6 years ago

So are we going to do anything here? Define - and +? Or leave things to lifting?

StefanKarpinski commented 6 years ago

This seems to demand a more general approach to me, rather than adding more and more methods for missing all over the place? When would it end? Never.

JeffBezanson commented 6 years ago

As a matter of fact, the need to lift values was levelled as a complaint against Nullable or Option types and was presented as a motivation for the new missing type!

This is a slight confusion of terms --- lifting values in the case of Nullable or Option types means you can't just use e.g. an Int. You have to use Nullable(2) instead of 2. Lifting functions, on the other hand, refers to wrapping a function so that it propagates missing values. The need for that is common to both Nullable/Option and the Missing approaches.

Since + and - support missing for other types, it seems reasonable to define them for Dates too.

nalimilan commented 6 years ago

I agree, once we have decided that a given operator should propagate missing values, we should follow this behavior for all types that implement it. The question of which operators should propagate missing is a separate one, and indeed something like lift (https://github.com/JuliaLang/julia/pull/26661) would be useful for those which don't propagate.

pdeffebach commented 6 years ago

This seems to demand a more general approach to me, rather than adding more and more methods for missing all over the place? When would it end? Never.

Take a look at an issue here where I (misguidedly) bring up missings in NaNMath.jl. The package maintainer fairly says that they don't anticipate their package being used in a context where missing values would be a concern.

That is to say, the space of "values that should always propagate missings" might be quite small. But of course means we would inevitably have gotchas where we have to say "No, that type doesn't support missings, use lift." The inconsistency would be tiring.

nalimilan commented 6 years ago

AFAICT NaNMath is a different question, as NaNMath.mean is a completely separate function from Statistics.mean. Better discuss that in the relevant issue.

ExpandingMan commented 5 years ago

I've just rather unexpectedly run into the lack of methods for + and - involving TimeType.

Is there any consensus around this? I can make a PR if everyone agrees to support it for + and -. I'm certainly open to arguments about why we should not, but as the lack of these methods came as a surprise to me today, I suspect they'll come as a surprise to many other people as well.

EconometricsBySimulation commented 5 years ago

@ExpandingMan I think there are some good methods using + or -. Only you need to make whatever type you are adding is a date type such as now()+Day(1). I think the reasoning is sound since you might want to add a second or a minute to the date/time. Whether it should be implied what the user is trying to do when you have a date object plus an integer might be a reasonable suggestion. Date(now()) + 1 Of course it is easy to add such methods: Base.:+(x::Date, y::Integer) = x + Day(y)

nilshg commented 4 years ago

I'm in the same boat as @ExpandingMan here - I've actually been using passmissing(i)(x, y) for a while, but it seems uncontroversial (and largely agreed in this thread) that Date(2020, 8, 26) + missing should return missing?

Are there any reasons not to add this?