JuliaStats / NullableArrays.jl

DEPRECATED Prototype of the new JuliaStats NullableArrays package
Other
35 stars 21 forks source link

Discussion: lifted operators #20

Open davidagold opened 9 years ago

davidagold commented 9 years ago

This issue is devoted to the discussion of lifted operators: both their design and implementation details, as well as the grounding philosophy that ought to inform our treatment of them in this package.

By "lifting" an operator f we mean employing some means of "dispatching" f or a related operator on Nullable objects, possibly even in the case where a method for f that takes Nullable arguments is not explicitly defined.

Slightly more formally, Let f(x::T) be an operator method, and let U = Union{U_1, ..., U_n} denote the range of possible return types of f(x::T). We might say that a lift of f(x::T) is a method that:

  1. has argument signature (x::Nullable{T})
  2. Handles null arguments in some appropriate fashion.
  3. makes the following diagram commutative:
                     lift(f)
Nullable{T}    >  >  >  >  >  >  >    Nullable{U}

^                                          ^
^ Nullable                                 ^ Nullable
^                                          ^
                        f
T              >  >  >  >  >  >  >         U

I think it would be best to restrict the discussion in this issue to lifting core mathematical and comparison operators; I intend open another issue to discuss how we ought to support lifting behavior for user-defined functions. I'm recanting, since I think that discussion of lifting user-defined operators has enough overlap with the present discussion such that it won't be able to help but bleed in. Also, insights with respect to one discussion will probably be valuable with respect to the other.

Some rough and ready guiding questions are:

  1. When to lift?
    a) Which core operators will be lifted?
    b) When will lifted versions be employed (as opposed to, say, throwing a NullException)?
  2. How to lift?
    a) How ought lifted operators to handle null arguments: (i) in the case of arithmetic operators? (ii) in the case of comparison operators?
    b) What ought the user interface for lifted operators to be?

Answers to these questions will reflect/constitute the package's general philosophy concerning the handling of null values and user interaction with null values. It's also worth noting how different approaches to a given issue will affect the space of another. For instance, if lifting occurs automatically via generic dispatch, (i.e. we simply overload, say + to handle Nullable arguments), then a conservative approach may be to lift only a small selection of core operators. However, if lifting must be explicitly indicated the user, (say by calling +? rather than +) then a similarly conservative approach may be to expand the base of lifted operators, since the means of calling lifted operators will require a user to be more conscious of the former's roles in her code.

Implicit in the above considerations is the assumption that it we ought to evaluate proposed lifting schemes according to (among other things) the extent to which users must explicitly designate lifting behaviors in their code. Other fairly straightforward criteria according to which we ought to evaluate lifting schemes are:

  1. Usability/Expressivity -- Does the lifting scheme permit users to accomplish their desired objectives efficiently? Does it provide expressive semantics that allow users to lift when they want to lift and avoid lifting when they don't want to lift?
  2. Performance -- Does the lifting scheme compromise the performance of generic user code? In particular, does it require specialized/idiomatic coding techniques to obtain performance?
  3. Safety -- Does the lifting scheme take care to avoid as much as possible nonsensical results involving Nullable values? Does it encourage users to think critically about their handling of null values?
  4. Extensibility -- Is the lifting scheme easily built upon in packages that provide statistical/visualization/interpretation functionality?
  5. Accessibility -- What will the learning curve for our lifting scheme be like for users coming from other languages (in particular R and Pandas)?

Many of the design decisions relating to lifting will feature tradeoffs between the criteria enumerated above. For instance, we have the option of having Nullable(1) < Nullable(5) return true or Nullable(true). The latter would be consistent with the general definition of a lifted operator provided above, and it would nudge the end user to incorporate null-handling in her code. However, it might also hinder usability, and it would present a learning curve for users coming from other languages. This example also illustrates how the use of Nullables to represent missing values allows for additional flexibility over the use of NA to represent missing values -- something we ought to keep in mind.

EDIT: Some external resources: JuliaLang/julia/pull/9446 this julia-dev mailing list thread

^It's worth repeating a summary of John's assessment from the above thread:

As I see it, progagation can work in a couple of ways: (1) Always off: you can't propagate nullability without explicitly constructing a Nullable object. (2) Opt-in at the call-site: you can propagate nullability by using something like map(f, Nullable). (3) Opt-in near the call-site: you can propagate nullability within a block of code using something like @propagate begin x = Nullable(1) + Nullable(2); sin(x) end (4) Opt-in at the definition site: you can propagate nullability by annotating a function as propagating in the same way that we annotate functions as vectorizable. (5) Always on: you can't prevent propagation without explicitly checking for nullability before calling any function.

(^Permalink to source)

nalimilan commented 9 years ago

Great summary. I guess you're aware of the recent discussion on julia-devel? https://groups.google.com/d/msg/julia-dev/WD7-vQeweJE/yGBFU2E2b2cJ

My position in two words is that we need a very simple way of lifting/propagating, be it because that's the default behaviour, or thanks to a very convenient syntax.

davidagold commented 9 years ago

Thank you, @nalimilan . Actually, I hadn't seen that thread! I'm reading it now. As my undergraduate training was in academic philosophy (and maths), and as I was particularly interested in the trends in 20th century metaphysics/philosophy of language that dealt with counterfactuals and modality, I'm very much digging this.

johnmyleswhite commented 9 years ago

Glad you opened this issue.

My current feelings:

Possibly worth linking to https://github.com/johnmyleswhite/DesignSpace.jl as documentation of the problems with Nullable{Union()}.

I continue to be as torn as ever about the safety/usability tradeoff. I guess the only thing I'm sure of now is that we can choose between safety and usability, but performance always has to come first.

davidagold commented 9 years ago

Thank you both John and Milan for your thoughts. Am I correct in assuming for now that we're going to try to make Nullable serve both the ontological role and the epistemological role that John describes in the e-mail list thread? Maybe it will become apparent that this conflation is actually unworkable, but for now it seems we may as well try to work with what we've got.

I'm on board for lifting a core set of operators that take isbits-true arguments. And I will indeed link to DesignSpace.jl; devising a method that supplies the return type in the case of null arguments is essential to safe lifting behavior, so it will be useful to have examples of the performance costs of defaulting to Nullable{Union{}} on hand.

My intuition, which I can't explain as thoroughly as I'd like, is that the closer a lifting annotation is to a call site, the more the annotation reflects the epistemological sense of Nullable, whereas the closer a lifting annotation is to a definition site, the more the annotation reflects the ontological sense of Nullable. When a user indicates at the definition site that the present function ought to lift she makes this decision amidst a number of others about the broader architecture of her program. When she lifts a function directly at the call site the decision is more likely in the service of some ad-hoc operation.

I suspect that lifting annotations more towards the "middle" between the call-site and the definition site will be most error prone and probably not as useful as one may initially think. For example, it might seem preferable at first to lift through an entire block of code with one @propagate annotation. But what if one wants to work with both non-nullable values within the block? It seems to me that getting a single @propogate macro to wrap the following if block, for instance, and to behave as desired will require a considerable amount of metaprogramming:

# y, z are Nullable; g, h takes non-Nullable arguments
# x is not Nullable; f may or may not take Nullable arguments
if f(x) > 10
    return g(y, z)
else
    return h(y, z)
end

One could require passing the names of functions to be lifted to the macro, but this has the disadvantages: (1) it becomes more verbose and possibly more work than annotating individual function calls; (2) there's no easy way to indicate that you want to lift one method of g at a particular call-site but not a different method of g at a different call-site. I'm not saying that smart propagation by-the-block is not possible, but it seems like edge cases will be common.

I agree with Milan that lifting must be readily accessible. My ideal for this would be a one- or two- character annotation that can attend directly at the call-site. I've been working on a macro @^ that takes a function call as an argument and performs an unsafe (for isbits-false Nullable arguments) ad-hoc call-site lift:

julia> macroexpand(:( @^ f(x, y) ))
:(Nullable(f(x.value, y.value), x.isnull || y.isnull))

I've also implemented two (relatively) safe versions: one that takes a type argument T and returns Nullable{T}() when a null argument is passed to f, and one that takes T to be the promoted eltype of the Nullable arguments:

julia> macroexpand(:( @lift1 f(x, y) Int ))
quote  # /Users/David/.julia/v0.4/NullableArrays/src/lift.jl, line 52:
    if x.isnull || y.isnull # line 53:
        Nullable{Int}()
    else  # line 55:
        Nullable(f(x.value, y.value))
    end
end

julia> macroexpand(:( @lift2 f(x, y) ))
quote  # /Users/David/.julia/v0.4/NullableArrays/src/lift.jl, line 77:
    if x.isnull || y.isnull # line 78:
        Nullable{promote_type((eltype)(x), (eltype)(y))}()
    else  # line 80:
        Nullable(f(x.value, y.value))
    end
end

Some preliminary tests show good performance:

srand(1)
N = 5_000_000
f(x, y) = x * y

function Base.(:+){S1, S2}(x::Nullable{S1}, y::Nullable{S2})
    return Nullable(Base.(:+)(x.value, y.value), x.isnull | y.isnull)
end

X = NullableArray(rand(N))
Y = NullableArray(rand(N))
Xn = NullableArray(rand(N), rand(Bool, N))
Yn = NullableArray(rand(N), rand(Bool, N))

function tracedot_unsafe(X, Y)
    res = Nullable(0.0)
    for i in 1:length(X)
         res += @^ f(X[i], Y[i])
    end
    return res
end

function tracedot_safe1(X, Y)
    res = Nullable(0.0)
    for i in 1:length(X)
         res += @lift1 f(X[i], Y[i]) Float64
    end
    return res
end

function tracedot_safe2(X, Y)
    res = Nullable(0.0)
    for i in 1:length(X)
         res += @lift2 f(X[i], Y[i])
    end
    return res
end

tracedot_unsafe(X, Y)
tracedot_safe1(X, Y)
tracedot_safe2(X, Y)
tracedot_unsafe(Xn, Yn)
tracedot_safe1(Xn, Yn)
tracedot_safe2(Xn, Yn)

@time(tracedot_unsafe(X, Y))
@time(tracedot_safe1(X, Y))
@time(tracedot_safe2(X, Y))
@time(tracedot_unsafe(Xn, Yn))
@time(tracedot_safe1(Xn, Yn))
@time(tracedot_safe2(Xn, Yn))

yields times

julia> include("/Users/David/.julia/v0.4/NullableArrays/src/lift.jl")
  25.452 milliseconds (5 allocations: 176 bytes) # @^, no null entries
  29.641 milliseconds (5 allocations: 176 bytes) # @lift1, no null entries
  31.367 milliseconds (5 allocations: 176 bytes) # @lift2, no null entries
  66.856 milliseconds (5 allocations: 176 bytes) # @^, approx half random null entries
  50.772 milliseconds (5 allocations: 176 bytes) # @lift1, approx half random null entries
  70.085 milliseconds (5 allocations: 176 bytes) # @lift2, approx half random null entries

Those times are fairly consistent across trials. The two "safe" macros can sometimes be used on isbits-false Nullable types:

A = NullableArray([ rand(10) for i in 1:10 ], rand(Bool, 10))
B = NullableArray([ rand(5_000) for i in 1:10_000 ], rand(Bool, 10_000))

function trace_means(A)
    x = Nullable(0.0)
    for i in 1:length(A)
        x += @lift1 mean(A[i]) Float64
    end
    return x
end

trace_means(A)
@time(trace_means(B))

julia> include("/Users/David/.julia/v0.4/NullableArrays/src/lift.jl")
  75.190 milliseconds (40123 allocations: 1175 KB)
Nullable{Float64}()

I say "sometimes" because whether or not @lift2 is actually safe will depend on the which notion of "type-stability" is applicable to the given situation. Since mean takes an Array and gives a Float64, using @lift2 would not be appropriate in this situation.

I'd like to offer @^ as the "main" lifting macro since it's syntactically lightweight, but I think it should be given one of the safe implementations. Here we can observe another interesting tradeoffs: @lift2 requires fewer arguments but is also less flexible than @lift1. Or, perhaps we oughtn't to advertise a single lifting macro as the "go-to" macro, but rather promote two or three lifting macros and give thorough explanations of the different situations for which each is especially suited. Supplying a type annotation with @lift1 may not even be so bad with the use of well-defined typealiases.

Of course, when functions are overhauled, we may then have the resources actually to provide a single lifting macro that is sufficiently flexible and safe to cover the majority of end user cases. Or, perhaps by that time there will be other means for universally quantifying over Nullable arguments. I've intended the foregoing to be more of a proof-of concept that simple call-site lifts have the potential to be (mostly) safe and performant, and perhaps provide some useful precedents for the future time when folks consider integrating lifting behavior into Julia Base.

Finally, if type annotations could be supplied easily and without cluttering code, then the tradeoff between safety and usability may become less of a tradeoff than we currently see it. In particular, it's possible that the user-end coding behaviors that make code dealing with Nullables safe actually overlap with the behaviors that make such code fast. Perhaps, then, users will not mind an API that requires them to work a bit harder to ensure their code is null-safe if doing so also nudges them to write more performant code the first time around. "Safe code is fast code."