JuliaStats / NullableArrays.jl

DEPRECATED Prototype of the new JuliaStats NullableArrays package
Other
35 stars 21 forks source link

[RFC] Fix [map]reduce with no non-null and single non-null element #158

Open alyst opened 8 years ago

alyst commented 8 years ago

Also test [map]reduce() with N=2 to expose this corner case.

alyst commented 8 years ago

0.4 and 0.5 ok, failing on nightlies in broadcast.jl, I'm not sure this is related to the PR.

codecov-io commented 8 years ago

Codecov Report

Merging #158 into master will decrease coverage by -27.77%.

@@             Coverage Diff             @@
##           master     #158       +/-   ##
===========================================
- Coverage   85.89%   58.12%   -27.77%     
===========================================
  Files          14       13        -1     
  Lines         865      855       -10     
===========================================
- Hits          743      497      -246     
- Misses        122      358      +236
Impacted Files Coverage Δ
src/reduce.jl 83.58% <55.55%> (+0.72%) :white_check_mark:
src/map0_4.jl 0% <ø> (-100%) :x:
src/subarray.jl 18.51% <ø> (-52.92%) :x:
src/show.jl 0% <ø> (-51.88%) :x:
src/operators.jl 52.5% <ø> (-42.63%) :x:
src/broadcast.jl 56.32% <ø> (-41.38%) :x:
src/indexing.jl 78.04% <ø> (-21.96%) :x:
src/nullablevector.jl 86.84% <ø> (-12.29%) :x:
src/map.jl 87.87% <ø> (-3.04%) :x:
src/primitives.jl 98.8% <ø> (-1.2%) :x:
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update b7b50c3...80f62d9. Read the comment docs.

alyst commented 8 years ago

Also fixed the return type of mapreduce() when the non-empty array contains no non-null elements (was returning T, should return Nullable{T}) and added the tests.

Actually, what should be the behaviour of mapreduce(skipnull=true) in the situations, where the are no non-null elements and the reduction over an empty array is not defined (no matching Base.mr_empty())? Right now it throws ArgumentError (actually thrown by the default Base.mr_empty()), but maybe it should return Nullable{T}() instead?

cc @davidagold, @nalimilan

alyst commented 8 years ago

Now fails on nightlies and also on 0.4 (all nulls test) I would like to fix that, but we should clarify the correct behaviour first.

nalimilan commented 8 years ago

Actually, what should be the behaviour of mapreduce(skipnull=true) in the situations, where the are no non-null elements and the reduction over an empty array is not defined (no matching Base.mr_empty())? Right now it throws ArgumentError (actually thrown by the default Base.mr_empty()), but maybe it should return Nullable{T}() instead?

We should do the same thing as what Base does on empty arrays. Apparently it throws an error, so let's continue doing that. If that's annoying, we should file an issue in Julia.

alyst commented 8 years ago

@nalimilan I suppose one of the Nullable roles is to silence exceptions like this. Otherwise one always has to check for such corner cases in advance, instead of doing simple post-hoc isnull(). And that's not fun for by(df, ...). I think it could be fixed by defining the default mr_empty()

Base.mr_empty{T}(f, op, Nullable{T}) = Nullable{T}()
nalimilan commented 8 years ago

Otherwise one always has to check for such corner cases in advance, instead of doing simple post-hoc isnull().

OK, that's an argument. Yet I guess Base has a strong reason for throwing an error instead of guessing the return type. I'm not sure which one, since in theory we should be able to infer the return type even when no element is present as long as we know the element type. But the definition you give doesn't do that: it assumes f and op will preserve the type, which isn't necessarily the case and will introduce type instabilities in some cases.

I think we need 1) to understand why Base does this, and 2) see how we can use Base.return_type to infer a correct type.

alyst commented 8 years ago

@nalimilan You're right, this definition is not correct, it should be something like

function Base.mr_empty{T<:DataType}(f, op, ::Type{Nullable{T}})
    f_T = Core.Inference.return_type(f, Tuple{T})
    op_T = Core.Inference.return_type(Base.r_promote, Tuple{typeof(op), f_T})
    return Nullable{op_T}()
end

Base defines the result of reduction over an empty array when "mathematically" it has some sense: sum([]) = 0.0, prod([]) = 1.0, maximum(abs, []) = 0.0 and few similar cases (I guess there are some algorithms relying on maximum(abs, []) == 0.0). By default it throws an error. So I think it would make "mathematical" sense to return Nullable by default instead of throwing an error. Trying to get the value would result in an exception anyway.

nalimilan commented 8 years ago

Shouldn't the function call Base.r_promote instead of calling return_type on it?

Anyway, I wonder whether we couldn't get rid of the mapreduce method for NullableArray and use the generic AbstractArray version. See https://github.com/JuliaStats/NullableArrays.jl/issues/153 for the same discussion about broadcast. For example, this already works with Array{Nullable}:

x = [Nullable{Int}()]
mapreduce(x->x*Nullable(2), +, x) # -> Nullable{Int64}()

So I'm not sure why we would need our custom method.

alyst commented 8 years ago

Shouldn't the function call Base.r_promote instead of calling return_type on it?

The problem is that there is no valid element of type T to promote.

Anyway, I wonder whether we couldn't get rid of the mapreduce method for NullableArray and use the generic AbstractArray version. See #153 for the same discussion about broadcast. For example, this already works with Array{Nullable}:

x = [Nullable{Int}()] mapreduce(x->x*Nullable(2), +, x) # -> Nullable{Int64}() So I'm not sure why we would need our custom method.

Since Base.mapreduce() doesn't have skipnull, this PR is more about these cases (on 0.5):

julia> x = Nullable{Int}[]
# 0-element Array{Nullable{Int64},1}

julia> mapreduce(x->x*Nullable(2), min, x)
# Nullable{Any}()

julia> mapreduce(x->x*Nullable(2), sum, x)
# Nullable{Any}()

julia> reduce(min, x)
# Nullable{Int64}()
nalimilan commented 8 years ago

The problem is that there is no valid element of type T to promote.

Use Base.r_promote_type then.

But I'd really like to know why Base doesn't does the same when the type can be inferred. There must be an issue about it.

alyst commented 8 years ago

The problem is that there is no valid element of type T to promote. Use Base.r_promote_type then.

I cannot find r_promote_type() in the Base.

But I'd really like to know why Base doesn't does the same when the type can be inferred. There must be an issue about it.

I think it's just "an issue" to be created. There's also e.g.

julia> sum(Nullable{Float64}[])
ERROR: MethodError: no method matching zero(::Type{Nullable{Float64}})
Closest candidates are:
  zero(::Type{Base.LibGit2.Oid}) at libgit2/oid.jl:88
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}) at pkg/resolve/versionweight.jl:80
  zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}) at pkg/resolve/versionweight.jl:120
  ...
 in _mapreduce(::Base.#identity, ::Base.#+, ::Base.LinearFast, ::Array{Nullable{Float64},1}) at ./reduce.jl:148
 in sum(::Array{Nullable{Float64},1}) at ./reduce.jl:229
 in eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:64
 in macro expansion at ./REPL.jl:95 [inlined]
 in (::Base.REPL.##3#4{Base.REPL.REPLBackend})() at ./event.jl:68
nalimilan commented 8 years ago

Funny, it's been added two days ago in master. So I guess we need to use the less pretty form for now. Would be nice to have a VERSION branch so that we can easily clean it up later.

The issue of reductions on empty collections is hardly new. See the links at: https://github.com/JuliaLang/julia/issues/18073#issuecomment-240501236

Though it might not have been discussed again since the introduction of Base.return_type. You could try filing an issue in Julia to get some thoughts.

As regard the sum(Nullable{Int}[]) failure, maybe we'll have to define zero and one for Nullable, as they are quite basic functions (just like operators). Doesn't sound great, but lifting cannot help us here...

alyst commented 8 years ago

As regard the sum(Nullable{Int}[]) failure, maybe we'll have to define zero and one for Nullable, as they are quite basic functions (just like operators). Doesn't sound great, but lifting cannot help us here...

Nullable support for zero() and one() would be nice, but in general there should be a mechanism to make [map]reduce([f,] op, Nullable{T}[]) consistent with [map]reduce([f,] op, T[]), i.e. returning Nullable{T}(), when the original version throws ArgumentError and Nullable{T}(v) when the original version returns v.

davidagold commented 8 years ago

What alternative is there to throwing an error when one reduces over an empty array? Since reduce does not, in general, return a container, you can't use the inferred return type, say U to somehow inform whatever such a reduction yields -- that is, one can't just return an empty Container{U}.

The case of NullableArrays is different. Using the generic (AbstractArray) reduce along with a higher-order lift should (unless I'm missing something) give the following behavior for X::NullableArray{T}:

I'd argue that this is almost exactly what we want. I kind of want to have a discussion about whether dropnull should return an Array or a NullableArray, which would affect the fourth behavior mentioned above. But in any case, I think that trying to reduce over missing data is not the same as trying to reduce over no data, and I'm happy if the former returns Nullable while the latter errors.

I think that, in general, we should try to rely extant frameworks in Base as much as possible. The main reasons not to are (1) performance concerns and (2) usability concerns. The community seems to be heading towards addressing (2) in packages that are closer to (tabular) data structures built on top of NullableArrays. As for (1), I think that, at this point, we should really be working with the Base language to improve compiler support for Nullable rather than hacking it onto NullableArrays. This would mean less maintenance work for us in the short term and would probably be better for both the language and code that deals with Nullables elsewhere in the long term.

alyst commented 8 years ago

@davidagold Thanks for your thorough considerations.

I totally agree that the Nullable behaviour should be defined by the Base as much possible. Actually, this PR started as a simple fix for the typo in 1-non-null case (it throws an error), which was not covered by the tests, and then 0-non-null return type. So while the behaviour fixes should be probably done in Base, I guess there should be some immediate fix for the bug.

Re the behaviour. It could work that the downstream packages (e.g. DataFrames) do the preliminary non-null checks during aggregation to prevent NullableArray throwing an exception in all-nulls skipnulls=true case, or catch that exception (whatever creates smaller overhead). I hope there would be just a very places where such check have to be injected, and the code for the check and Nullable{U}() type inference would be compact. OTOH, if DataFrames aggregation is the main/typical use case of reduction over nullables, it would make sense to make non-throwing nullable-returning behaviour the default and implement it in Base. The third option is to add strict=true/false kw argument for mapreduce() dealing with nullables (strict throws an error, non-strict returns null).

davidagold commented 8 years ago

@alyst hmm, you're definitely right that there's a fifth behavior we want, namely non-throwing over non-empty but all-null X::NullableArray. What do you think about the following proposal: we let reduce(op, X) fall back on the generic implementation, but implement lift such that lift(reduce)(op, X) has the following behavior:

nalimilan commented 8 years ago

lift(reduce)(op, X) sounds a bit surprising to me. Couldn't we use reduce(lift(op), X) for this? Or do you plan to give it different semantics?

davidagold commented 8 years ago

The difference between lift(reduce)(op, X) and reduce(lift(op), X) as I've delineated them above presents in the case where X has some null, some non-null values. In this case, lift(reduce)(op, X) will return reduce(op, dropnull(X)) which is non-null, whereas reduce(lift(op), X) will return (typically, assuming standard lifting semantics for op), an empty Nullable. This makes a difference in the case Alexey was talking about, aggregating over grouped data. If you want to handle aggregation with the same function over all groups, then reduce(lift(op), col) will return an empty Nullable both when a group has all null values and when a group has only some non-null values. I assume that's not what we want?

We could handle this by requiring/lowering to reduce(lift(op), dropnull(col)) and making reduction over empty NullableArrays (that is, NullableArrays with no elements, rather than an all-null NullableArray) not throw an error. Turns out this is what R does (y'all probably knew this already), but with a warning:

> x <- c()
> mean(x)
[1] NA
Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA

Though it doesn't seem to be returning NA because x is empty per se. I'd be fine with breaking from the AbstractArray generic behavior and returning an empty Nullable with a warning rather than throwing an exception. Requiring dropnull may be a bit annoying to users, though from a statistical perspective making that step explicit is definitely preferable. In any case, including dropnull could be done automatically in a querying macro if we decide later that the usability factor is a big concern.

alyst commented 8 years ago

Actually, R doesn't warn if it's an empty logical or numeric, c() is something different:

> x <- c()
> x
NULL
> x <- as.integer()
> x
integer(0)
> mean(x)
[1] NaN
> mean(x, na.rm=TRUE)
[1] NaN
davidagold commented 8 years ago

Good point. Looks like Julia implements the same behavior for mean, at least for empty vectors with numeric eltypes.

nalimilan commented 8 years ago

The difference between lift(reduce)(op, X) and reduce(lift(op), X) as I've delineated them above presents in the case where X has some null, some non-null values. In this case, lift(reduce)(op, X) will return reduce(op, dropnull(X)) which is non-null, whereas reduce(lift(op), X) will return (typically, assuming standard lifting semantics for op), an empty Nullable.

Got it, thanks. Indeed returning null whenever the input contains a null isn't really useful for a type which is designed precisely to support missing values. I've always argued that this apparent correctness wouldn't offer any additional safety to users for who there always are nulls in the data. But as you say this can be handled by convenience macros.

nalimilan commented 7 years ago

For consistency with https://github.com/JuliaStats/NullableArrays.jl/pull/166, I think we should have (map)reduce automatically lift. To implement this, we could do the same as for broadcast, i.e. call mapreduce(x -> lift(op, x...), ...). That would return null if the input contains a null. Then we could have an alternative version of lift which would skip nulls, i.e. lift_skipnull(op, x::Nullable, Nullable()) -> x. The choice between these two could be made via skipnull=true as currently.

This approach avoids calling dropnull, which creates a copy.