JuliaStats / Statistics.jl

The Statistics stdlib that ships with Julia.
https://juliastats.org/Statistics.jl/dev/
Other
72 stars 40 forks source link

Add init kwarg to mean/median to match other reductions in Base #105

Open non-Jedi opened 2 years ago

non-Jedi commented 2 years ago

The absence of an init kwarg like is available with reduce and sum makes mean compose less well with e.g. skipmissing. I'm finding myself using let y=collect(skipmissing(x)); isempty(y) ? missing : mean(y) end when I should be able to just say mean(skipmissing(x); init=missing).

There may be other reductions than mean and median in Statistics.jl that also should have init added, but those are the two I've needed (median isn't quite a reduction technically, but it's close enough that an init kwarg would be useful).

nalimilan commented 2 years ago

Makes sense. Would you be willing to make a pull request?

mbauman commented 2 months ago

Surely mean(skipmissing(x); init=missing) would just always return missing? But yes, this seems like a nice addition for other uses.

nalimilan commented 2 months ago

Ah good point. @non-Jedi that wouldn't work for your use case then.

non-Jedi commented 2 months ago

For both reduce and sum, the docstrings indicate that whether init is used or not is unspecified:

If provided, the initial value init must be a neutral element for op that will be returned for empty collections. It is unspecified whether init is used for non-empty collections.

The value returned for empty itr can be specified by init. It must be the additive identity (i.e. zero) as it is unspecified whether init is used for non-empty collections.

I'd argue that we should just go ahead and specify that init is only used for empty collections for these two functions since there's by definition no truly neutral element. Is there another, better way to spell "return the mean or this value for empty collections"? I'm not attached to using init specifically for this.

nalimilan commented 2 months ago

I think this is unspecified because for technical (performance) reasons sometimes we need to use init for non-empty collections. And that's the case in simple examples like this:

julia> sum([1], init=missing)
missing

julia> sum([1], init=100)
101

Is there another, better way to spell "return the mean or this value for empty collections"? I'm not attached to using init specifically for this.

AFAICT this problem essentially happens with skipmissing, so we could add a convenience function in Missings.jl. For example, nan2missing(mean(skipmissing(x)) with nan2missing(x) = isnan(x) ? missing : x would do the trick. Or is your problem that mean(skipmissing(x)) throws when x contains only missing values due to its eltype (e.g. mean(skipmissing([missing])))? If so we could add something like skipmissingnotempty(x) = isempty(x) ? Iterators.repeated(missing, 1) : skipmissing(x) (with a better name and implementation).

non-Jedi commented 2 months ago

I think this is unspecified because for technical (performance) reasons sometimes we need to use init for non-empty collections. And that's the case in simple examples like this:

To be clear, I wasn't suggesting we change and specify the meaning of init for any of the extant reduction functions in Base only that it wouldn't be an awful inconsistency if mean and median were explicit that init isn't used for non-empty collections.

AFAICT this problem essentially happens with skipmissing, so we could add a convenience function in Missings.jl. For example, nan2missing(mean(skipmissing(x)) with nan2missing(x) = isnan(x) ? missing : x would do the trick. Or is your problem that mean(skipmissing(x)) throws when x contains only missing values due to its eltype (e.g. mean(skipmissing([missing])))? If so we could add something like skipmissingnotempty(x) = isempty(x) ? Iterators.repeated(missing, 1) : skipmissing(x) (with a better name and implementation).

I think the problem is more the general awkwardness in julia of working with possibly empty collections--especially non-materialized ones--and not specific to working with missing (several years ago I created a PR to the main julia repo trying to solve a similar issue for Iterators.first and Iterators.last). Adding any of the methods you suggest wouldn't help when the collection is potentially empty due to a call to e.g. Iterators.filter. And I can't just look for NaN as a sentinel value since it's also returned by e.g. mean([NaN]).

In my ideal world, I think mean(::Float64) would return either Union{Float64,Nothing} or Union{Some{Float64},Nothing} with nothing acting as the sentinel value for an empty collection. Then I could just use something for this.

Note that just adding an init kwarg with the same semantics as Base.reduce doesn't help even with my non-missing use-cases: I want a concise way to express taking the mean of a collection such that mean([]) == 0 and that mean([5]) != 2.5; current semantics of Base.reduce would suggest that whether mean([5]; init=0) is equal to 5 or 2.5 is unspecified.

aplavin commented 2 months ago

I think the problem is more the general awkwardness in julia of working with possibly empty collections--especially non-materialized ones--and not specific to working with missing

See AccessorsExtra for a composable approach to this problem:

julia> using AccessorsExtra

julia> maybe(first)([1,2,3])
1
julia> maybe(first)([])
# nothing

julia> maybe(@o first(_).a)([(a=1,)])
1
julia> maybe(@o first(_).a)([(b=1,)])
# nothing
julia> maybe(@o first(_).a)([])
# nothing

# convenience functions using the same machinery:
julia> x = [1,2,3];
julia> @oget only(x) NaN
NaN
julia> x = [1];
julia> @oget only(x) NaN
1
julia> x = [];
julia> @oget only(x) NaN
NaN

It fundamentally uses nothing as a sentinel, as elsewhere in Julia. Doesn't work with mean() as-is because of its semantics (mean returns NaN for empty, not nothing nor exception). Some details still not fleshed out, but the general design is there.

non-Jedi commented 2 months ago

(#132 is a related issue with related discussion and possibly duplicate)

mbauman commented 2 months ago

I'd argue that we should just go ahead and specify that init is only used for empty collections for these two functions since there's by definition no truly neutral element. Is there another, better way to spell "return the mean or this value for empty collections"? I'm not attached to using init specifically for this.

Funny, that's precisely how I ended up deep down this rabbit hole. See https://github.com/JuliaLang/julia/pull/53945 for where I think the consensus is pointing us.

non-Jedi commented 2 months ago

Thanks @mbauman. I think after reading through some of the interlinked issues and PRs that I'm convinced adding an init kwarg to mean/median is definitely not the right solution to this problem. Particularly helpful was mikmoore's comment here. mean and median are even less like foldl/foldr than reduce is, so let's not introduce the same pun/confusion.

From @mbauman's post immediately following, point 2 is not a concern since op is not user-specified, so calling this ifempty or similar is more appropriate.

Possible solutions I see right now:

  1. Accept that differentiating between the mean of an empty collection and the mean of a collection containing NaNs will require materializing the collection first.
  2. Add an ifempty kwarg to mean/median.
  3. Introduce a new function (not sure what to call it) that takes the mean of a collection but returns nothing for empty collections.
  4. Change current implementations to return nothing for empty collections and increment the major version number of Statistics.
  5. Add wrapper function that allows specifying default value if empty as discussed in #132.

The discussion in #132 seemed to heavily prefer solution 5 over solution 2, and I'm not sure I really understand why. @aplavin could you comment on this since you were a proponent of solution 5?

nalimilan commented 2 months ago

The discussion in #132 seemed to heavily prefer solution 5 over solution 2, and I'm not sure I really understand why. @aplavin could you comment on this since you were a proponent of solution 5?

The advantage of a wrapper function is that it immediately works with any function (mean, median or any custom user function), and it doesn't make the implementation of these functions more complex. So it ensures consistency across the ecosystem, without having to ensure that similar functions in all packages support the same keyword argument with the same semantics. (That's true at least if you consider that the default implementation calling isempty(x) is fast enough. Otherwise you need to override it for specific functions too.)

aplavin commented 2 months ago

@aplavin could you comment on this since you were a proponent of solution 5?

Basically what @nalimilan says, to be able to use it with any kind of function. Mean is nothing special here, and it would be much cleaner to add a single wrapper/postprocessing function than provide options to handle empty collections in each aggregation.

It doesn't preclude having the most commonly used options added directly to the most commonly used functions, like mean. The same story as with the dims argument: some reductions support it natively, but there's a fully generic eachslice mechanism to apply any aggregation over any dimension.