JuliaLang / Juleps

Julia Enhancement Proposals
Other
67 stars 24 forks source link

[WIP] Nullable Redesign #21

Closed johnmyleswhite closed 7 years ago

johnmyleswhite commented 7 years ago

This draft describes the core ideas behind a redesign of nullables. It will require substantial elaboration before we have a final design, but my goal is to get this out in front of people while we are still debating the remaining points of uncertainty in the design. The primary source of remaining uncertainty are the arguments in favor of Union{T, Void} vs Union{Some{T}, Void}.

andyferris commented 7 years ago

my goal is to get this out in front of people while we are still debating the remaining points of uncertainty in the design. The primary source of remaining uncertainty are the arguments in favor of Union{T, Void} vs Union{Some{T}, Void}.

I'm taking this as requests for feedback/opinions (apologies if I misunderstood).

I would advocate strongly for Union{T, Null{T}} because I feel it would help enormously with writing generic methods. If my method receives a Null{Vector{Int}} at least I know what's the element type of the vector that I was supposed to get, and that might help me to arrive at the correct type of my output. Having Void here doesn't help me predict what the user wants me to do, and it doesn't help me to write different methods of a function that might have slightly different semantics depending on input types.

I also feel that Void and nothing already have a meaning, in terms they are the return value of the thing the compiler doesn't have a return value for - like the result of a for loop, or an empty function body, or an if false statement with no else branch. Having to diagnose whether an unfamiliar function returns nothing on purpose or just because it just happened to be a "null" value for my chosen input would add unnecessary complication.

I see you also mentioned #undef. How would this work? Ideally, I guess we could make the GC aware of how Union works. This could be pretty compelling... but we probably need to solve the below:

There is one thing that remains to confuse me. Currently we can have Nullable{Any}. However Union{Any, Void} = Any (and so-on). I don't think we want to change these properties. So what, for example, would a Nullable{Any}() or an #undef of Any become?

johnmyleswhite commented 7 years ago

Thanks for the comments, Andy!

I should point out that the Union{T, Null{T}} implementation is not really seriously being considered as far as I understand the growing consensus. Instead, that implementation is being documented to demonstrate why it offers few benefits over our current implementation.

On that note, I think you're misunderstanding the way Union{T, Void} would work. That's really helpful for me to see because I'll try harder to ensure that revisions of this Julep communicate this point more clearly.

If we use Union{T, Void}, we'll be abandoning the zero-or-one element container way of thinking of Nullable{T} and instead we'd be using what is essentially a tagged union that always contains exactly one value: either a value of type T or a value of type Void. We can think of this as a container to make the use of broadcast more reasonable, but we're mostly taking about broadcast because it provides minimal syntax for lifting.

Will address your other comments tomorrow morning.

andyferris commented 7 years ago

Thanks @johnmyleswhite - that clarifies things immensely. So we would have something which roughly has this structure:

immutable Nullable{T}
    value::Union{T,Void}
end
size(::Nullable) = ()
getindex(n::Nullable) = n.value
eltype{T}(::Nullable{T}) = Union{T,Void} # ??
isnull(n::Nullable) = isa(n.value, Void)

and then use broadcast, "lifting", and other machinery to push the nullables through the system?

andyferris commented 7 years ago

I should point out that the Union{T, Null{T}} implementation is not really seriously being considered as far as I understand the growing consensus. Instead, that implementation is being documented to demonstrate why it offers few benefits over our current implementation.

Of course this:

immutable Nullable{T}
    value::Union{T,Null{T}}
end

offers few benefits, and if that is what you meant, then I agree entirely. To be clear, I was suggesting to use "bare" nullables with no wrapper type, to replace DataArrays.NA with Base.Null{T}, and to remove Base.Nullable entirely (except maybe as a typealias for a Union).

johnmyleswhite commented 7 years ago

Revised the document. Still a bunch more to do (and a few comments left to respond to), but I'm now almost completely convinced we'd want to make ?T mean Union{Some{T}, Void} since the benefits from having T and ?T represent disjoint sets seem to outweigh other considerations.

tpapp commented 7 years ago

Having run into the "counterfactual return type problem" a few times, I would very much like having a Union{Some{T}, Void}. However, I am wondering if the language features (with the exception of syntactic sugar) and compiler optimizations would generalize so that users would be able to efficiently implement similar constructs and related machinery (lifting etc). To make things concrete, I am thinking of types like Union{Some{T},Err{E}} following result types in Rust. Note that I am not suggesting that Base does this, just that libraries can experiment with such constructs and get efficient code.

andyferris commented 7 years ago

Yes, the plan would be to make all Union{A,B} types as fast as they can be as a generic compiler optimization, so packages should be able to experiment with whatever return types they prefer (of course, it would be nice if the ecosystem would play nicely together).

datnamer commented 7 years ago

Do you think this would happen pre 1.0?

johnmyleswhite commented 7 years ago

If this doesn't happen 1.0 it's hard for me to imagine it changing later on since we'll get stuck with a lot of design decisions. But it's ultimately @vtjnash's call since he has the expertise to make this happen that I lack.

vtjnash commented 7 years ago

I'm hoping to (and working on) getting thr start of this into v0.6

TotalVerb commented 7 years ago

@nalimilan mentioned in #20502 that Union{T, Null}, with special support so that f(::Null) = Null() is automatically defined for every function, may help resolve the difficulties of #20502.

This is not currently discussed here (though possibly it has been discussed in side channels), but is definitely worth consideration.

johnmyleswhite commented 7 years ago

Can we get input from @JeffBezanson on whether or not he's willing to allow a universal function definition of f(x::Null) = Null() for all f?

quinnj commented 7 years ago

I'll add in a quick 2 cents: a while back, @StefanKarpinski and I were discussing this a little and kind of came to the conclusion that this should be ok, since we're basically introducing an additional form of missingness/nullability into the language, apart from nothing::Void. In that way, nothing continues to be nothing, no behavior changed and it remains the "language-level" form of nothingness, with very little behavior defined on it.

null::Null, however, would be come a more substantial form of missingness with more behaviors defined.

Not sure if that actually helps the discussion any, but it seemed to clear things up in my mind in that we're not trying to mess w/ nothing and how it currently works in the language, just introduce an additional form of nothingness w/ more behavior.

TotalVerb commented 7 years ago

Let me add something to ensure that all potential questions have answers.

One difficulty with that approach is distinguishing when an f(x) = ... definition intends to lift over x::Null or otherwise. For instance, we might want string(Null()) == "Null()". Now if someone defines

foo(x) = length(string(x))

is it clear whether this should be Null() or 6 for foo(Null())?

TotalVerb commented 7 years ago

Unrelated to the current matter of discussion; but one thing we should do in 0.6 if at all still possible, is deprecate using the ternary operator without spaces. Otherwise things like

@foo x ?Int :sym

will parse in an unexpected way.

TotalVerb commented 7 years ago

There are also other functions which we must consider Null to not lift over. I wonder if maintaining such a list of functions might be just as much work as maintaining a list of functions it should lift over.

And then of course a list of functions where the lifting behaviour is ambiguous

I am sure this list is incomplete; but it must be something to consider.

ararslan commented 7 years ago

The more I think about it, the more I think we shouldn't lift by default; cases where f(Null) should not be Null seem to be equally prevalent if not more so than those where it should be Null. (Plus, I don't think Jeff has changed his stance on automatic lifting for Base functions.) It's easy enough to determine the subset of Base functions which should provide lifting, and packages can opt to provide functionality for their functions as well. But making a broad assumption on what f(Null) should be seems a little dangerous.

TotalVerb commented 7 years ago

There seems to be some consensus recently of Nullable going back to what it used to be, with NAType, except now with compiler support so it's faster. I wonder if the name should also change to NAType, as this seems in some ways to be moving away from what is called "null" in other programming languages.

ararslan commented 7 years ago

I don't think it matters too much how we spell it so long as the behavior is well defined and well documented.

andyferris commented 7 years ago

I also think "universal" default definitions (lifting) for Null would be undesirable. I think there are many situations where an error is better than continuing to propagate the Null.

I still believe that Null{T} would be a nicer semantic. Some functions support a wide range of signatures and not having the T means it's ambiguous which method to correspond with (and there may exist important semantic differences between the different method signatures). I feel that would be a real problem and become a major pain point when you are doing anything more complex than basic operations like +. The other nice thing about this that you can (safely) define operations on Null{T} such as &(x::Bool, ::Null{Bool}) = x ? Null{Bool}() : false for e.g. 3 valued logic (and users can do the same safely for a variety of types). Without the T then we should assume the Null value might be of Any type, so I wouldn't like to see 3-valued logic implemented with &(::Bool, ::Null) since users are free to overload the behavior of &(::Bool, ::T) for their own types T.

However, to get this semantic of a missing value of type T, we might benefit from some compiler support (specifically, from the type system). Ideally, Null{T} would be covariant in T to make working with non-concrete types easier and more consistent. For example, if Null{Int} represents some missing value of type Int, then I wonder if this missing value should also be considered of type Null{Integer}? That is, given that types represent sets of possible values, for every (concrete?) type T we create the nullable set by adding an extra member of the set Null{T}() which represents a missing value (let's name here the combined set as type Nullable{T} = Union{T, Null{T}}, though I think we should find a new name like ?T or something). We could imagine the set Nullable{S} (where S is an abstract supertype of T) as including the value Null{T}() (as well as Null{S}()?). (Does this make any sense, @JeffBezanson? no it does not)

If we don't get this behavior then a signature like f(::Nullable{Integer}) would fail for Null{Int}() and I do not believe that accurately captures the intentions of the author of f. If we did implement this covariant behavior, we do get that Null (being Null{T} where T) and Null{Any} are more-or-less equivalent (as may be Any and Nullable{Any}...), so in many cases ::Null would still be useful to put in signatures and there shouldn't be undue burden on users/package authors unless they actually require the type specificity to make dispatch consistent.

EDIT: What I said here about covariance of Null{T} is rubbish. Null{<:T} seems to have all the desired properties, e.g. Null{<:Integer} includes Null{Int64}(), Null{Unsigned}(), Null{Integer}(), Null{Union{UInt8, Int8}}(), etc.

ararslan commented 7 years ago

@andyferris I don't see how Null{T} would differ significantly, or in any beneficial way, from the current Nullable{T}.

Some functions support a wide range of signatures and not having the T means it's ambiguous which method to correspond with

That's why you define proper promotions in Base and locally define f(::Null) to be whatever makes sense. If not knowing the type when calling a method on a null value is a problem, I think that's more indicative of a design flaw.

andyferris commented 7 years ago

If not knowing the type when calling a method on a null value is a problem, I think that's more indicative of a design flaw.

Sorry, some functions are overloaded with lots and lots of methods on purpose, e.g. try methods(+). Making use of multiple dispatch is the whole point of Julia, and null values of certain types will follow distinctive propagation rules. If we allow for e.g. 3-valued logic,

&(x::Bool, ::Null) = x ? Null() : false
&(::Null, y::Bool) = y ? Null() : false
&(::Null, ::Null) = Null()

and another user creates a type T which

then they will not ever be able to stop &(false, Null) from being false::Bool instead of some value of T or Null.

I would prefer we didn't add things which constrain multiple dispatch and user's ability to overload methods, whenever they decide they also need to use missing values. In the example above, the user would need to define another &-like function instead of adding a method to &. It seems, to me, to be against the Julian way.

I do agree that for typical, database stype operations, you're not going to run into this esoteric set of conditions. But the fact is that we can foresee circumstances where using the type and dispatch system more strongly will lead to bad interactions with Null.

andyferris commented 7 years ago

In short terms, if a user defines T such that &(::Bool, ::T) -> T, they might expect to be able to enforce that &(::?Bool, ::?T) -> ?T. However, if anyone implements 3-valued logic on Bool, they won't be allowed to do that.

ararslan commented 7 years ago

That doesn't seem like a problem to me; that's just kind of how I expect Unions to work. The behavior is clear when you think in terms of the methods defined for the given function. We just have to make sure the behavior is well documented, then people will know they can't expect to be able to enforce Union{Bool, Null} & Union{T, Null} = Union{T, Null}.

andyferris commented 7 years ago

That doesn't seem like a problem to me; that's just kind of how I expect Unions to work.

Of course, this is exactly how Union{T,Null} would work. I was just suggesting that Union{T,Null{T}} would follow the same Union rules but is more flexible for users.

Do note that the covariance idea means that 99.9% of the time you can just use Null and ?T for dispatch, and any users that want Union{Bool, Null{Bool}} & Union{T, Null{T}} = Union{T, Null{T}} can still get that. It just seems like a more flexible solution, is all. The other problem with untyped Null and ?T = Union{T,Null} is that dispatch would have many ambiguities if users use ::?T in signatures. I can imagine that I'd like to write:

f(x::?T1) = if isnull(x) ? #= do a =# : #= do b =#  # note that branch is elided by compiler!
f(x::?T2) = if isnull(x) ? #= do c =# : #= do d =#
f(Null) # ambiguous

That basically means it's never safe to dispatch on ?T since two users/packages (who might not be collaborating) might both overload Base.f(::?T) for same f and different T, and end up with this ambiguity problem. This seems (to me at least) as a major flaw.

I don't see how Null{T} would differ significantly, or in any beneficial way, from the current Nullable{T}.

I'm arguing that there is (or that there might be) a benefit for Null{T} over Null. I'm wondering what people see as the downsides (especially w.r.t. dispatching on ?T)?

EDIT: please replace "covariance idea" with "using Null{<:T}" when reading the above.

andyferris commented 7 years ago

I don't see how Null{T} would differ significantly, or in any beneficial way, from the current Nullable{T}.

I'm arguing that there is (or that there might be) a benefit for Null{T} over Null. I'm wondering what people see as the downsides (especially w.r.t. dispatching on ?T)?

Or to answer your question another way, the only problem I saw with Nullable was that it is a container. Null{T}() is not a container, it's a value, just as Null() would be.

What are the problems that Null{T}() inherits from Nullable{T}()? Is it that you might need to sometimes use promote_op and so-on? (I actually think you won't need that nearly as much for Null{T} as you would for Nullable{T}, since you can just treat it as a value. Only containers of ?T would use promote_op just as they do for T itself).

johnmyleswhite commented 7 years ago

The relevant problem with Nullable{T} is also the problem with Null{T}: they're parametric types. Being a container or a union is not the problem: the problem is computing the type parameters of a parametric type in the absence of type inference.

andyferris commented 7 years ago

The relevant problem with Nullable{T} is also the problem with Null{T}: they're parametric types. Being a container or a union is not the problem: the problem is computing the type parameters of a parametric type in the absence of type inference.

Actually it's only parametric containers that rely on inference.

You can for instance define +(::Null{Int32}, ::Null{Int64}) = Null{Int64}(). No inference required. Just dispatch.

You could use promote or promote_op(+, ...) to have a generic method for +(::Null{T1}, Null{T2}), but unlike the container situation, you are not forced to, and I would suggest we don't use promote_op (promote might be OK, wherever it is already used?).

johnmyleswhite commented 7 years ago

I give up. I'm done having these debates when no one involved is paying me for my time.

andyferris commented 7 years ago

John - please don't take me the wrong way. We all really appreciate the effort that is going into this - I think it is really fantastic!

(I'm only trying to bounce some ideas around and have a technical discussion about some things which I thought might have been overlooked)

TotalVerb commented 7 years ago

@johnmyleswhite I too greatly appreciate the effort that has gone into this Julep. Thanks for all your hard work; it's truly important for building a strong foundation for 1.0.

@andyferris There is a flaw, in my opinion, with the covariant Null{T} idea. It makes the assumption that every type on its own represents a coherent algebraic structure, to which we can add a unique null type. By making Null{T}() an instance iff T is concrete, we lose the ability to adjoin a null element to algebraic structures which are, for whatever reason, represented by an abstract type.

Consider for example

abstract type Symbolic <: Real end
struct SymInteger <: Symbolic; val::BigInt; end
struct Reciprocal <: Symbolic; of::Symbolic; end

and imagine that, conventionally, Symbolic is used to denote some real number represented symbolically in this fashion. Then, what we desire is a Null{Symbolic}() constant. But with Null{T}() existing iff T concrete, what we instead have are two disjoint Null{SymInteger}() and Null{Reciprocal}() types, which does not seem fun to have to deal with.

andyferris commented 7 years ago

@TotalVerb I think you might be right (I think there was a ? in my post next to the "concrete"). Can we have consistent covariance (Null{SymInteger}()::Null{Symbolic}) as well as a Null{Symbolic}() value?

TotalVerb commented 7 years ago

Hmm, that would be unusual compared to anything else in the language (it would be a type which has an instance, but also has subtypes), but I suppose there is nothing preventing its implementation. Moving on to practical issues though, as mentioned in the Julep, consider

function map(f, x::Nullable{T})
    if !isnull(x)
        return Nullable(f(get(x)))
    else
        return Nullable{Core.Inference.return_type(f, (T, ))}()
    end
end

If we rewrite this with Union{T, Null{T}} we obtain

map(f, x::Any) = f(x)
map{T}(f, x::Null{T}) = Null{Core.Inference.return_type(f, Tuple{T})}()

which not only has an inference dependency, but also has a strange definition of map. It seems like we would be giving up on map entirely with Union{T, Null{T}}. This would be a departure from Scala, Java, Rust, F#, Swift, Kotlin, {Ca|Standard}ML. While I'm not saying that is necessarily a bad thing, we should consider why map is considered useful in such languages, and see if we have a replacement for those use cases.

TotalVerb commented 7 years ago

An instructive exercise, perhaps, is to find some concrete examples of code that uses Nullables in practice, and consider how it would be written using each suggested interpretation. The applications in NullableArrays are well-known and of course well-studied, but we should consider other applications also.

Let me get the ball rolling with some parts of Base Julia that use Nullable or an equivalent type.

I suspect that a lot of these use cases will be relatively independent of which interpretation is chosen, but we should also consider package code that makes use of nullable.

andyferris commented 7 years ago

Hmm, that would be unusual compared to anything else in the language (it would be a type which has an instance, but also has subtypes), but I suppose there is nothing preventing its implementation.

Actually that does seem like a really big departure from the current type system.

If we rewrite this (map) with Union{T, Null{T}} we obtain

Do people need map(f, x::Union{T, Null}) for values? If T is not a container; you just call f(x). Of course, you can do map(sin, 1) and for the null case we can make Null an iterable of length 1, just like Int.

If T is a container such as T <: AbstractArray, say, then we'd need an extra method for map for that, and it might invoke inference on the element types if that is what it does now for empty arrays anyway... seems equivalent.

TotalVerb commented 7 years ago

I think you're right. map is for containers; Union{T, Null{T}} is not a container and so makes it not generally necessary.

However, this goes back to the original issue of when functions should be lifted over null arguments. If I want to compute f.(x::Nullable{T}), whether or not the person who defined function f defined f(::Null{T}), there should still be a way to do that; otherwise we are losing the functionality of map. This is, modulo the inference dependency, I think fairly easy to resolve:

lift(f, x) = f(x)
lift(f, x::Null{T}) = Null{C.I.return_type(f, Tuple{T})}()

However, because there is no more Some{T} tag indicating that something was part of a Nullable, this means that we need to use a new syntax, and can no longer reuse the broadcast dot syntax.

andyferris commented 7 years ago

Thank you very much @TotalVerb for helping me figure out the trade-offs here. It seems true that the Null{T} approach does seem to preclude "lifting" methods (or entire functions) without performing a type inference step.

My general approach here has been to see if sufficiently clever dispatch would mean lifting is not necessary - and demanding that users/package authors provide the necessary method. I will concede that this might not be very user-friendly, especially in the case that the methods are coming from somebody else's package. In any case, I would suggest providing a range of common methods for Null or Null{T} so that lifting is relatively rare. And, if lifting is relatively rare it (speculatively) seems not too ridiculous to invoke inference in these cases? I dunno.

Anyway, it seems that Null is better suited for an explicitly annotated lifting approach and Null{T} is more geared towards dispatch.

andyferris commented 7 years ago

By the way, everything I was saying about the covariance of Null{T} is rubbish. In Julia v0.6 it's dead easy to express Null{<:Integer} which includes Null{Int64}(), Null{Unsigned}(), Null{Integer}(), Null{Union{UInt8, Int8}}(), etc.

Thus I would suggest having ?T = Union{T, Null{<:T}} (or whatever the syntax for the left-hand side here becomes). Then dispatch for f(::?Integer) would be quite sensible. No special compiler/type-system treatment required.

nalimilan commented 7 years ago

My general approach here has been to see if sufficiently clever dispatch would mean lifting is not necessary - and demanding that users/package authors provide the necessary method. I will concede that this might not be very user-friendly, especially in the case that the methods are coming from somebody else's package. In any case, I would suggest providing a range of common methods for Null or Null{T} so that lifting is relatively rare. And, if lifting is relatively rare it (speculatively) seems not too ridiculous to invoke inference in these cases? I dunno.

@andyferris This approach has already been tried with vectorized methods, and it was considered as a design flaw. In 0.6 we've been able to move away from it by using the f.(x) syntax for element-wise operation. So we've followed that route already and we know it won't work. If a user needs to pass a null values to a function provided by a package which doesn't support it, s/he shouldn't need to rewrite the function. I think repeating these discussions once again is one of the reasons John closed the PR.

Also the issue with inference is a deep one. Even if we get rid of map support for nullables, lifting will have by definition to infer the T for Null{T} if we make it parametric. It's not OK for it to fail in some cases, as these cases will behave differently from others if you dispatch on the inferred type. What's the point of opting for Null{T} if we can't rely on it? Note that the case of empty collections is considered as fragile, but we haven't found a better solution. Making it the foundation of nullables isn't the best idea to write a robust system.

That basically means it's never safe to dispatch on ?T since two users/packages (who might not be collaborating) might both overload Base.f(::?T) for same f and different T, and end up with this ambiguity problem. This seems (to me at least) as a major flaw.

Yes, and that shows the only solution in that case is to have a common definition f(::Null) = Null(). That's why I suggest we do this by default for all functions.

(The example of 3VL & and | is not representative at all, since these are the only two functions I know for which we want lifting, but not with the standard lifting semantics. So let's not base the whole system on this special case, for which it could be OK to define methods manually.)

Finally, I'm not sure we need so many functions not to be lifted. Functions for which passing a nullable makes no sense should generally not be lifted, and people shouldn't pass nullables to them. Why would one call lpad(Null()) or run(Null()) and expect not getting Null() as a result?

Anyway, we could start without a default lifting fallback and see how it goes. At least, Union{T, Null} would be as functional as DataArrays or Nullable, but more convenient than the latter since non-null values would not even need unwrapping.

andyferris commented 7 years ago

Thank you so much @nalimilan for your wonderful and very detailed response. I have certainly learned a lot today from this discussion. I had been interested in the development of Nullable and for quite some time have been following the public discussions, issues and this PR as best I can, yet I always was left with (central) one question: "why use lifting with Union-based nullables?" (as opposed to dispatch, which does require extra method definitions and certainly inconvenience, but nothing worse than e.g. C++ would be). I am very glad for receiving a full and direct explanation.

In any case, I feel that having this discussion has been valuable exercise. This certainly is a tricky design problem, and there are clearly rather few available options which satisfy e.g. determinism w.r.t. inference and user-friendliness.

@johnmyleswhite I'd like to apologize to you directly - as I alluded to earlier, I have been very impressed with the leadership shown with this PR and, in fact, have been looking forward to seeing this PR merge. It was never my intention to derail this process. However, I think it's reasonable to say this Julep has become a focus point for the future of Nullable and I assumed it would be an OK place to check my understanding and assumptions about this design, clear up some misunderstandings, float ideas, and in general, to come to a consensus. Thank you for your efforts - and I still believe this Julep process has been, and will continue to be, very worthwhile.

nalimilan commented 7 years ago

I think we should merge this PR and add the missing details later.