Open non-Jedi opened 2 years ago
Makes sense. Would you be willing to make a pull request?
Surely mean(skipmissing(x); init=missing)
would just always return missing
? But yes, this seems like a nice addition for other uses.
Ah good point. @non-Jedi that wouldn't work for your use case then.
For both reduce
and sum
, the docstrings indicate that whether init
is used or not is unspecified:
If provided, the initial value init must be a neutral element for op that will be returned for empty collections. It is unspecified whether init is used for non-empty collections.
The value returned for empty itr can be specified by init. It must be the additive identity (i.e. zero) as it is unspecified whether init is used for non-empty collections.
I'd argue that we should just go ahead and specify that init
is only used for empty collections for these two functions since there's by definition no truly neutral element. Is there another, better way to spell "return the mean or this value for empty collections"? I'm not attached to using init
specifically for this.
I think this is unspecified because for technical (performance) reasons sometimes we need to use init
for non-empty collections. And that's the case in simple examples like this:
julia> sum([1], init=missing)
missing
julia> sum([1], init=100)
101
Is there another, better way to spell "return the mean or this value for empty collections"? I'm not attached to using
init
specifically for this.
AFAICT this problem essentially happens with skipmissing
, so we could add a convenience function in Missings.jl. For example, nan2missing(mean(skipmissing(x))
with nan2missing(x) = isnan(x) ? missing : x
would do the trick. Or is your problem that mean(skipmissing(x))
throws when x
contains only missing values due to its eltype (e.g. mean(skipmissing([missing]))
)? If so we could add something like skipmissingnotempty(x) = isempty(x) ? Iterators.repeated(missing, 1) : skipmissing(x)
(with a better name and implementation).
I think this is unspecified because for technical (performance) reasons sometimes we need to use
init
for non-empty collections. And that's the case in simple examples like this:
To be clear, I wasn't suggesting we change and specify the meaning of init
for any of the extant reduction functions in Base
only that it wouldn't be an awful inconsistency if mean
and median
were explicit that init
isn't used for non-empty collections.
AFAICT this problem essentially happens with
skipmissing
, so we could add a convenience function in Missings.jl. For example,nan2missing(mean(skipmissing(x))
withnan2missing(x) = isnan(x) ? missing : x
would do the trick. Or is your problem thatmean(skipmissing(x))
throws whenx
contains only missing values due to its eltype (e.g.mean(skipmissing([missing]))
)? If so we could add something likeskipmissingnotempty(x) = isempty(x) ? Iterators.repeated(missing, 1) : skipmissing(x)
(with a better name and implementation).
I think the problem is more the general awkwardness in julia of working with possibly empty collections--especially non-materialized ones--and not specific to working with missing
(several years ago I created a PR to the main julia repo trying to solve a similar issue for Iterators.first
and Iterators.last
). Adding any of the methods you suggest wouldn't help when the collection is potentially empty due to a call to e.g. Iterators.filter
. And I can't just look for NaN
as a sentinel value since it's also returned by e.g. mean([NaN])
.
In my ideal world, I think mean(::Float64)
would return either Union{Float64,Nothing}
or Union{Some{Float64},Nothing}
with nothing
acting as the sentinel value for an empty collection. Then I could just use something
for this.
Note that just adding an init
kwarg with the same semantics as Base.reduce
doesn't help even with my non-missing use-cases: I want a concise way to express taking the mean of a collection such that mean([]) == 0
and that mean([5]) != 2.5
; current semantics of Base.reduce
would suggest that whether mean([5]; init=0)
is equal to 5 or 2.5 is unspecified.
I think the problem is more the general awkwardness in julia of working with possibly empty collections--especially non-materialized ones--and not specific to working with missing
See AccessorsExtra for a composable approach to this problem:
julia> using AccessorsExtra
julia> maybe(first)([1,2,3])
1
julia> maybe(first)([])
# nothing
julia> maybe(@o first(_).a)([(a=1,)])
1
julia> maybe(@o first(_).a)([(b=1,)])
# nothing
julia> maybe(@o first(_).a)([])
# nothing
# convenience functions using the same machinery:
julia> x = [1,2,3];
julia> @oget only(x) NaN
NaN
julia> x = [1];
julia> @oget only(x) NaN
1
julia> x = [];
julia> @oget only(x) NaN
NaN
It fundamentally uses nothing
as a sentinel, as elsewhere in Julia. Doesn't work with mean()
as-is because of its semantics (mean
returns NaN for empty, not nothing nor exception).
Some details still not fleshed out, but the general design is there.
(#132 is a related issue with related discussion and possibly duplicate)
I'd argue that we should just go ahead and specify that
init
is only used for empty collections for these two functions since there's by definition no truly neutral element. Is there another, better way to spell "return the mean or this value for empty collections"? I'm not attached to usinginit
specifically for this.
Funny, that's precisely how I ended up deep down this rabbit hole. See https://github.com/JuliaLang/julia/pull/53945 for where I think the consensus is pointing us.
Thanks @mbauman. I think after reading through some of the interlinked issues and PRs that I'm convinced adding an init
kwarg to mean/median is definitely not the right solution to this problem. Particularly helpful was mikmoore's comment here. mean
and median
are even less like foldl
/foldr
than reduce
is, so let's not introduce the same pun/confusion.
From @mbauman's post immediately following, point 2 is not a concern since op
is not user-specified, so calling this ifempty
or similar is more appropriate.
Possible solutions I see right now:
NaN
s will require materializing the collection first.ifempty
kwarg to mean
/median
.nothing
for empty collections.nothing
for empty collections and increment the major version number of Statistics.The discussion in #132 seemed to heavily prefer solution 5 over solution 2, and I'm not sure I really understand why. @aplavin could you comment on this since you were a proponent of solution 5?
The discussion in #132 seemed to heavily prefer solution 5 over solution 2, and I'm not sure I really understand why. @aplavin could you comment on this since you were a proponent of solution 5?
The advantage of a wrapper function is that it immediately works with any function (mean
, median
or any custom user function), and it doesn't make the implementation of these functions more complex. So it ensures consistency across the ecosystem, without having to ensure that similar functions in all packages support the same keyword argument with the same semantics. (That's true at least if you consider that the default implementation calling isempty(x)
is fast enough. Otherwise you need to override it for specific functions too.)
@aplavin could you comment on this since you were a proponent of solution 5?
Basically what @nalimilan says, to be able to use it with any kind of function. Mean is nothing special here, and it would be much cleaner to add a single wrapper/postprocessing function than provide options to handle empty collections in each aggregation.
It doesn't preclude having the most commonly used options added directly to the most commonly used functions, like mean. The same story as with the dims
argument: some reductions support it natively, but there's a fully generic eachslice
mechanism to apply any aggregation over any dimension.
The absence of an
init
kwarg like is available withreduce
andsum
makesmean
compose less well with e.g.skipmissing
. I'm finding myself usinglet y=collect(skipmissing(x)); isempty(y) ? missing : mean(y) end
when I should be able to just saymean(skipmissing(x); init=missing)
.There may be other reductions than
mean
andmedian
in Statistics.jl that also should haveinit
added, but those are the two I've needed (median isn't quite a reduction technically, but it's close enough that aninit
kwarg would be useful).