Open alyst opened 8 years ago
0.4 and 0.5 ok, failing on nightlies in broadcast.jl
, I'm not sure this is related to the PR.
@@ Coverage Diff @@
## master #158 +/- ##
===========================================
- Coverage 85.89% 58.12% -27.77%
===========================================
Files 14 13 -1
Lines 865 855 -10
===========================================
- Hits 743 497 -246
- Misses 122 358 +236
Impacted Files | Coverage Δ | |
---|---|---|
src/reduce.jl | 83.58% <55.55%> (+0.72%) |
:white_check_mark: |
src/map0_4.jl | 0% <ø> (-100%) |
:x: |
src/subarray.jl | 18.51% <ø> (-52.92%) |
:x: |
src/show.jl | 0% <ø> (-51.88%) |
:x: |
src/operators.jl | 52.5% <ø> (-42.63%) |
:x: |
src/broadcast.jl | 56.32% <ø> (-41.38%) |
:x: |
src/indexing.jl | 78.04% <ø> (-21.96%) |
:x: |
src/nullablevector.jl | 86.84% <ø> (-12.29%) |
:x: |
src/map.jl | 87.87% <ø> (-3.04%) |
:x: |
src/primitives.jl | 98.8% <ø> (-1.2%) |
:x: |
... and 3 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update b7b50c3...80f62d9. Read the comment docs.
Also fixed the return type of mapreduce()
when the non-empty array contains no non-null elements (was returning T
, should return Nullable{T}
) and added the tests.
Actually, what should be the behaviour of mapreduce(skipnull=true)
in the situations, where the are no non-null elements and the reduction over an empty array is not defined (no matching Base.mr_empty()
)? Right now it throws ArgumentError
(actually thrown by the default Base.mr_empty()
), but maybe it should return Nullable{T}()
instead?
cc @davidagold, @nalimilan
Now fails on nightlies and also on 0.4 (all nulls test) I would like to fix that, but we should clarify the correct behaviour first.
Actually, what should be the behaviour of mapreduce(skipnull=true) in the situations, where the are no non-null elements and the reduction over an empty array is not defined (no matching Base.mr_empty())? Right now it throws ArgumentError (actually thrown by the default Base.mr_empty()), but maybe it should return Nullable{T}() instead?
We should do the same thing as what Base does on empty arrays. Apparently it throws an error, so let's continue doing that. If that's annoying, we should file an issue in Julia.
@nalimilan I suppose one of the Nullable
roles is to silence exceptions like this. Otherwise one always has to check for such corner cases in advance, instead of doing simple post-hoc isnull()
. And that's not fun for by(df, ...)
. I think it could be fixed by defining the default mr_empty()
Base.mr_empty{T}(f, op, Nullable{T}) = Nullable{T}()
Otherwise one always has to check for such corner cases in advance, instead of doing simple post-hoc isnull().
OK, that's an argument. Yet I guess Base has a strong reason for throwing an error instead of guessing the return type. I'm not sure which one, since in theory we should be able to infer the return type even when no element is present as long as we know the element type. But the definition you give doesn't do that: it assumes f
and op
will preserve the type, which isn't necessarily the case and will introduce type instabilities in some cases.
I think we need 1) to understand why Base does this, and 2) see how we can use Base.return_type
to infer a correct type.
@nalimilan You're right, this definition is not correct, it should be something like
function Base.mr_empty{T<:DataType}(f, op, ::Type{Nullable{T}})
f_T = Core.Inference.return_type(f, Tuple{T})
op_T = Core.Inference.return_type(Base.r_promote, Tuple{typeof(op), f_T})
return Nullable{op_T}()
end
Base defines the result of reduction over an empty array when "mathematically" it has some sense: sum([]) = 0.0
, prod([]) = 1.0
, maximum(abs, []) = 0.0
and few similar cases (I guess there are some algorithms relying on maximum(abs, []) == 0.0
). By default it throws an error. So I think it would make "mathematical" sense to return Nullable by default instead of throwing an error. Trying to get the value would result in an exception anyway.
Shouldn't the function call Base.r_promote
instead of calling return_type
on it?
Anyway, I wonder whether we couldn't get rid of the mapreduce
method for NullableArray
and use the generic AbstractArray
version. See https://github.com/JuliaStats/NullableArrays.jl/issues/153 for the same discussion about broadcast
. For example, this already works with Array{Nullable}
:
x = [Nullable{Int}()]
mapreduce(x->x*Nullable(2), +, x) # -> Nullable{Int64}()
So I'm not sure why we would need our custom method.
Shouldn't the function call Base.r_promote instead of calling return_type on it?
The problem is that there is no valid element of type T
to promote.
Anyway, I wonder whether we couldn't get rid of the mapreduce method for NullableArray and use the generic AbstractArray version. See #153 for the same discussion about broadcast. For example, this already works with Array{Nullable}:
x = [Nullable{Int}()] mapreduce(x->x*Nullable(2), +, x) # -> Nullable{Int64}() So I'm not sure why we would need our custom method.
Since Base.mapreduce()
doesn't have skipnull
, this PR is more about these cases (on 0.5):
julia> x = Nullable{Int}[]
# 0-element Array{Nullable{Int64},1}
julia> mapreduce(x->x*Nullable(2), min, x)
# Nullable{Any}()
julia> mapreduce(x->x*Nullable(2), sum, x)
# Nullable{Any}()
julia> reduce(min, x)
# Nullable{Int64}()
The problem is that there is no valid element of type T to promote.
Use Base.r_promote_type
then.
But I'd really like to know why Base doesn't does the same when the type can be inferred. There must be an issue about it.
The problem is that there is no valid element of type T to promote. Use Base.r_promote_type then.
I cannot find r_promote_type()
in the Base.
But I'd really like to know why Base doesn't does the same when the type can be inferred. There must be an issue about it.
I think it's just "an issue" to be created. There's also e.g.
julia> sum(Nullable{Float64}[])
ERROR: MethodError: no method matching zero(::Type{Nullable{Float64}})
Closest candidates are:
zero(::Type{Base.LibGit2.Oid}) at libgit2/oid.jl:88
zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuildItem}) at pkg/resolve/versionweight.jl:80
zero(::Type{Base.Pkg.Resolve.VersionWeights.VWPreBuild}) at pkg/resolve/versionweight.jl:120
...
in _mapreduce(::Base.#identity, ::Base.#+, ::Base.LinearFast, ::Array{Nullable{Float64},1}) at ./reduce.jl:148
in sum(::Array{Nullable{Float64},1}) at ./reduce.jl:229
in eval_user_input(::Any, ::Base.REPL.REPLBackend) at ./REPL.jl:64
in macro expansion at ./REPL.jl:95 [inlined]
in (::Base.REPL.##3#4{Base.REPL.REPLBackend})() at ./event.jl:68
Funny, it's been added two days ago in master. So I guess we need to use the less pretty form for now. Would be nice to have a VERSION
branch so that we can easily clean it up later.
The issue of reductions on empty collections is hardly new. See the links at: https://github.com/JuliaLang/julia/issues/18073#issuecomment-240501236
Though it might not have been discussed again since the introduction of Base.return_type
. You could try filing an issue in Julia to get some thoughts.
As regard the sum(Nullable{Int}[])
failure, maybe we'll have to define zero
and one
for Nullable
, as they are quite basic functions (just like operators). Doesn't sound great, but lifting cannot help us here...
As regard the sum(Nullable{Int}[]) failure, maybe we'll have to define zero and one for Nullable, as they are quite basic functions (just like operators). Doesn't sound great, but lifting cannot help us here...
Nullable
support for zero()
and one()
would be nice, but in general there should be a mechanism to make [map]reduce([f,] op, Nullable{T}[])
consistent with [map]reduce([f,] op, T[])
, i.e. returning Nullable{T}()
, when the original version throws ArgumentError and Nullable{T}(v)
when the original version returns v
.
What alternative is there to throwing an error when one reduces over an empty array? Since reduce
does not, in general, return a container, you can't use the inferred return type, say U
to somehow inform whatever such a reduction yields -- that is, one can't just return an empty Container{U}
.
The case of NullableArray
s is different. Using the generic (AbstractArray
) reduce
along with a higher-order lift
should (unless I'm missing something) give the following behavior for X::NullableArray{T}
:
X
is non-empty but has no non-null values, then reduce(lift(op), X)
returns an empty Nullable{U}
where U
is determined by type inference conducted in lift(op, x::T, y::T)
.X
is empty, then reduce(lift(op), X)
errorsX
is non-empty but has no non-null values, then reduce(op, dropnull(X))
errorsX
is non-empty and has non-null values, then reduce(op, dropnull(X))
returns the reduction applied to an Array
containing only the non-null values of X
.I'd argue that this is almost exactly what we want. I kind of want to have a discussion about whether dropnull
should return an Array
or a NullableArray
, which would affect the fourth behavior mentioned above. But in any case, I think that trying to reduce over missing data is not the same as trying to reduce over no data, and I'm happy if the former returns Nullable
while the latter errors.
I think that, in general, we should try to rely extant frameworks in Base as much as possible. The main reasons not to are (1) performance concerns and (2) usability concerns. The community seems to be heading towards addressing (2) in packages that are closer to (tabular) data structures built on top of NullableArray
s. As for (1), I think that, at this point, we should really be working with the Base language to improve compiler support for Nullable
rather than hacking it onto NullableArray
s. This would mean less maintenance work for us in the short term and would probably be better for both the language and code that deals with Nullable
s elsewhere in the long term.
@davidagold Thanks for your thorough considerations.
I totally agree that the Nullable behaviour should be defined by the Base as much possible. Actually, this PR started as a simple fix for the typo in 1-non-null case (it throws an error), which was not covered by the tests, and then 0-non-null return type. So while the behaviour fixes should be probably done in Base, I guess there should be some immediate fix for the bug.
Re the behaviour. It could work that the downstream packages (e.g. DataFrames
) do the preliminary non-null checks during aggregation to prevent NullableArray throwing an exception in all-nulls skipnulls=true case, or catch that exception (whatever creates smaller overhead). I hope there would be just a very places where such check have to be injected, and the code for the check and Nullable{U}()
type inference would be compact. OTOH, if DataFrames aggregation is the main/typical use case of reduction over nullables, it would make sense to make non-throwing nullable-returning behaviour the default and implement it in Base. The third option is to add strict=true/false
kw argument for mapreduce()
dealing with nullables (strict throws an error, non-strict returns null).
@alyst hmm, you're definitely right that there's a fifth behavior we want, namely non-throwing over non-empty but all-null X::NullableArray
. What do you think about the following proposal: we let reduce(op, X)
fall back on the generic implementation, but implement lift
such that lift(reduce)(op, X)
has the following behavior:
X
is empty, throw an errorX
is non-empty but all-null, return Nullable{U}()
, where U
is chosen according via type inference (with some fall-back for non-concrete inferred types, if this makes sense). X
is non-empty and not all null, return Nullable(reduce(op, dropnull(X))
(or reduce(lift(op), dropnull(X))
if we decide that dropnull
should actually return a NullableArray
).lift(reduce)(op, X)
sounds a bit surprising to me. Couldn't we use reduce(lift(op), X)
for this? Or do you plan to give it different semantics?
The difference between lift(reduce)(op, X)
and reduce(lift(op), X)
as I've delineated them above presents in the case where X
has some null, some non-null values. In this case, lift(reduce)(op, X)
will return reduce(op, dropnull(X))
which is non-null, whereas reduce(lift(op), X)
will return (typically, assuming standard lifting semantics for op
), an empty Nullable
. This makes a difference in the case Alexey was talking about, aggregating over grouped data. If you want to handle aggregation with the same function over all groups, then reduce(lift(op), col)
will return an empty Nullable
both when a group has all null values and when a group has only some non-null values. I assume that's not what we want?
We could handle this by requiring/lowering to reduce(lift(op), dropnull(col))
and making reduction over empty NullableArray
s (that is, NullableArray
s with no elements, rather than an all-null NullableArray
) not throw an error. Turns out this is what R does (y'all probably knew this already), but with a warning:
> x <- c()
> mean(x)
[1] NA
Warning message:
In mean.default(x) : argument is not numeric or logical: returning NA
Though it doesn't seem to be returning NA
because x
is empty per se. I'd be fine with breaking from the AbstractArray
generic behavior and returning an empty Nullable
with a warning rather than throwing an exception. Requiring dropnull
may be a bit annoying to users, though from a statistical perspective making that step explicit is definitely preferable. In any case, including dropnull
could be done automatically in a querying macro if we decide later that the usability factor is a big concern.
Actually, R doesn't warn if it's an empty logical or numeric, c()
is something different:
> x <- c()
> x
NULL
> x <- as.integer()
> x
integer(0)
> mean(x)
[1] NaN
> mean(x, na.rm=TRUE)
[1] NaN
Good point. Looks like Julia implements the same behavior for mean
, at least for empty vectors with numeric eltype
s.
The difference between lift(reduce)(op, X) and reduce(lift(op), X) as I've delineated them above presents in the case where X has some null, some non-null values. In this case, lift(reduce)(op, X) will return reduce(op, dropnull(X)) which is non-null, whereas reduce(lift(op), X) will return (typically, assuming standard lifting semantics for op), an empty Nullable.
Got it, thanks. Indeed returning null whenever the input contains a null isn't really useful for a type which is designed precisely to support missing values. I've always argued that this apparent correctness wouldn't offer any additional safety to users for who there always are nulls in the data. But as you say this can be handled by convenience macros.
For consistency with https://github.com/JuliaStats/NullableArrays.jl/pull/166, I think we should have (map)reduce
automatically lift. To implement this, we could do the same as for broadcast
, i.e. call mapreduce(x -> lift(op, x...), ...)
. That would return null if the input contains a null. Then we could have an alternative version of lift
which would skip nulls, i.e. lift_skipnull(op, x::Nullable, Nullable()) -> x
. The choice between these two could be made via skipnull=true
as currently.
This approach avoids calling dropnull
, which creates a copy.
Also test
[map]reduce()
with N=2 to expose this corner case.