Open aplavin opened 10 months ago
Bump...
Gentle bump... I understand that developer time is always scarce, so: would a PR fixing this be accepted?
I wasn't involved in the previous discussions in #671 and https://github.com/JuliaStats/StatsBase.jl/pull/814, but my impression was that the errors are intentional and the previous behaviour (no errors) was considered a bug. I think I tend to agree with this view since weights are supposed to sum to 1 (or at least it should be able to normalize them to such weights), which is not possible if a weight is NaN
or Inf
.
The whole point of NaN
is that it propagates, and plain aggregations play nicely with that – mean
of an array with NaN
is NaN
. This is natural and convenient when doing many aggregations, with a few of them having NaNs: they just appear in the correct locations in the results, instead of requiring try/catch
or manual isnan
checks everywhere.
It's surprising and inconvenient that as soon as one needs weighted aggregations, suddenly there's no way around – manual trycatch/isnan checks are required.
I don't see a clear inconsistency, there's no restriction on the values and Inf
/NaN
propagate in the same way for weighted operations:
julia> using StatsBase
julia> mean([1,2,3,Inf])
Inf
julia> mean([1,2,3,Inf], weights([1,1,1,1]))
Inf
julia> mean([1,2,3,NaN])
NaN
julia> mean([1,2,3,NaN], weights([1,1,1,1]))
NaN
The restriction only applies to the weights themselves because they have to sum up to 1.
The restriction only applies to the weights themselves because they have to sum up to 1.
And if they don't because there's a NaN
– then the result is... Not a Number :)
Exactly the same as with mean([1, NaN])
that already returns NaN
because there's no actual number that represents the result. I don't really see any difference in these situations, and the same motivation works in both cases.
Consider two scenarios:
Have an array of arrays, and want to compute mean
of each inner array:
mean.(A)
and NaN
are correctly propagated.
Have an array of arrays of values, and array of arrays of weights, and want to compute weighted mean
s?
mean.(A, weight.(W))
doesn't work as soon as there are any NaN
s inside W
, and requires manual handling.
For those stumbling across the same issue: AbstractWeights
and many functions that take them actually support NaNs. Defining another type MyWeights <: AbstractWeights
without NaN checks in the constructor will work just fine!
Or, even easier – just overload the Weights
type constructor by putting this code anywhere you like:
using StatsBase
@generated Weights{S,T,V}(values, sum) where {S<:Real, T<:Real, V<:AbstractVector{T}} = Expr(:new, Weights{S,T,V}, :values, :sum)
That's what I personally do.
It doesn't change any existing behavior, just adds support for non-finite values in weight vectors. Then:
julia> mean([1,2], weights([1,NaN]))
NaN
Currently,
weights
throws an error whenever it encounters non-finite values:ArgumentError: weights cannot contain Inf or NaN values
. That's highly surprising IMO, basically all other operations propagate NaNs and allow Infs – both basic arithmetics and aggregations.I looked up as found that this error was introduced after the issue https://github.com/JuliaStats/StatsBase.jl/issues/671. I think throwing this exception denies perfectly sensible behavior, so that weighted aggregations worked exactly as their unweighted counterparts – by propagating NaNs. And the throw in
weights()
should be removed, maybe adding it in specific aggregations where Infs/NaNs really do not make any sense (eg, mean and median should work).Numpy does propagate, of course: