JuliaStats / StatsBase.jl

Basic statistics for Julia
Other
585 stars 191 forks source link

Inf and NaN in weights #904

Open aplavin opened 10 months ago

aplavin commented 10 months ago

Currently, weights throws an error whenever it encounters non-finite values: ArgumentError: weights cannot contain Inf or NaN values. That's highly surprising IMO, basically all other operations propagate NaNs and allow Infs – both basic arithmetics and aggregations.

# totally sensible and unambiguous results:
julia> median([1,2,3,Inf])
2.5
julia> mean([1,2,3,Inf])
Inf
julia> mean([1,2,3,Inf], weights([1,1,1,1]))
Inf

julia> mean([1,2,3,NaN])
NaN
julia> median([1,2,3,NaN])
NaN

# and suddenly:
julia> mean([1,2,3,4], weights([1,1,Inf,1]))
ERROR: ArgumentError: weights cannot contain Inf or NaN values
julia> mean([1,2,3,4], weights([1,1,NaN,1]))
ERROR: ArgumentError: weights cannot contain Inf or NaN values

I looked up as found that this error was introduced after the issue https://github.com/JuliaStats/StatsBase.jl/issues/671. I think throwing this exception denies perfectly sensible behavior, so that weighted aggregations worked exactly as their unweighted counterparts – by propagating NaNs. And the throw in weights() should be removed, maybe adding it in specific aggregations where Infs/NaNs really do not make any sense (eg, mean and median should work).

Numpy does propagate, of course:

In [2]: np.average([1,2,3], weights=[1,np.nan,3])
Out[2]: nan
aplavin commented 9 months ago

Bump...

aplavin commented 8 months ago

Gentle bump... I understand that developer time is always scarce, so: would a PR fixing this be accepted?

devmotion commented 8 months ago

I wasn't involved in the previous discussions in #671 and https://github.com/JuliaStats/StatsBase.jl/pull/814, but my impression was that the errors are intentional and the previous behaviour (no errors) was considered a bug. I think I tend to agree with this view since weights are supposed to sum to 1 (or at least it should be able to normalize them to such weights), which is not possible if a weight is NaN or Inf.

aplavin commented 8 months ago

The whole point of NaN is that it propagates, and plain aggregations play nicely with that – mean of an array with NaN is NaN. This is natural and convenient when doing many aggregations, with a few of them having NaNs: they just appear in the correct locations in the results, instead of requiring try/catch or manual isnan checks everywhere. It's surprising and inconvenient that as soon as one needs weighted aggregations, suddenly there's no way around – manual trycatch/isnan checks are required.

devmotion commented 8 months ago

I don't see a clear inconsistency, there's no restriction on the values and Inf/NaN propagate in the same way for weighted operations:

julia> using StatsBase

julia> mean([1,2,3,Inf])
Inf

julia> mean([1,2,3,Inf], weights([1,1,1,1]))
Inf

julia> mean([1,2,3,NaN])
NaN

julia> mean([1,2,3,NaN], weights([1,1,1,1]))
NaN

The restriction only applies to the weights themselves because they have to sum up to 1.

aplavin commented 8 months ago

The restriction only applies to the weights themselves because they have to sum up to 1.

And if they don't because there's a NaN – then the result is... Not a Number :) Exactly the same as with mean([1, NaN]) that already returns NaN because there's no actual number that represents the result. I don't really see any difference in these situations, and the same motivation works in both cases.

Consider two scenarios:

Have an array of arrays, and want to compute mean of each inner array:

mean.(A)

and NaN are correctly propagated.

Have an array of arrays of values, and array of arrays of weights, and want to compute weighted means?

mean.(A, weight.(W))

doesn't work as soon as there are any NaNs inside W, and requires manual handling.

aplavin commented 8 months ago

For those stumbling across the same issue: AbstractWeights and many functions that take them actually support NaNs. Defining another type MyWeights <: AbstractWeights without NaN checks in the constructor will work just fine!

Or, even easier – just overload the Weights type constructor by putting this code anywhere you like:

using StatsBase
@generated Weights{S,T,V}(values, sum) where {S<:Real, T<:Real, V<:AbstractVector{T}} = Expr(:new, Weights{S,T,V}, :values, :sum)

That's what I personally do.

It doesn't change any existing behavior, just adds support for non-finite values in weight vectors. Then:

julia> mean([1,2], weights([1,NaN]))
NaN