wmedian: the first weight is ignored when ProbabilityWeights is used

perrette commented 4 months ago

Hi, I clearly lack subtelty about the various definitions of weighted quantiles (and I passed quickly over the above discussion as a result), but I thought I'd share another, more obvious example of what what the current implementation is doing:

The frequency weights seems to do what I'd expect:

julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1000, 1, 1, 1]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1000, 1, 1]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1000, 1]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.FrequencyWeights([1, 1, 1, 1000]))
4.0

unlike the ProbabilityWeights:

julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1000, 1, 1, 1]/1003))
2.500000000000056
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1000, 1, 1]/1003))
1.501
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1000, 1]/1003))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1, 1, 1, 1000]/1003))
3.499

which seems to:

ignore the first weight purely and simply (isn't that a bug ?? -- it's hard to believe it comes down to definition, unless it's a definition for continuous analytics applied bluntly to discrete arrays -- I ask the author(s) to please excuse me, I mean no offense)
takes the mean between whatever two numbers the (weighted) median falls into, which can never be exactly reached

Worse, it is numerically inaccurate:

julia> eps = 1e-50 ;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
4.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
1.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
2.5
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
3.5

whereas setting epsilon to exactly zero yields the same (and for me, expected) result as FrequencyWeights

julia> eps = 0.;
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([1-3*eps, eps, eps, eps]))
1.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, 1-3*eps, eps, eps]))
2.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, 1-3*eps, eps]))
3.0
julia> StatsBase.wmedian([1, 2, 3, 4], StatsBase.ProbabilityWeights([eps, eps, eps, 1-3*eps]))
4.0

IMO the above shows surprising results that may go beyond the difference between various definitions (especially that the first weight is ignored, and possibly the discontinuity at the limit when the weights tend toward being concentrated on one element, though OK, discontinuities are parts of mathematics -- but they often don't help when analyzing real data).

Anyway, for me the workaround will be to multiply my weights by a large number, convert to integers, and use FrequencyWeights instead of ProbabilityWeights.

Originally posted by @perrette in https://github.com/JuliaStats/StatsBase.jl/issues/435#issuecomment-2162412722

matthieugomez commented 3 months ago

I am the author of the function agree with all these issues — they all come down to the arbitrariness of the first weight. Additionally, the part of the weighted quantile function where one removes the zero is probably wrong as it means that true zeroes and things equal to zero or close to zero will give discontinuous results.

That being said, I cannot think of a good fix. I guess I just really can't see how the quantile definition used by Julia can be naturally extended to weighted vector without implying weird edge cases. If you can find a more natural way, feel free to do a PR. Multiplying all weights by a large number so that they become integers would not really solve your issues since replacing a 0 weight by eps() would lead to very different results too.

perrette commented 3 months ago

Hi @matthieugomez, thanks for replying. I am not familiar with Julia 's quantile function, but if that can advance in any way this discussion, I use the following function in python:

import numpy as np

def weighted_quantiles(values, weights, quantiles, interpolate=False):
    """
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), 0.5)
    2 
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), 0.5, interpolate=True)
    2.5
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1000, 1, 1, 1]), 0.5)
    1
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1000, 1, 1]), 0.5)
    2
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1000, 1]), 0.5)
    3
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1, 1000]), 0.5)    
    4
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1000, 1, 1, 1]), 0.5, interpolate=True)
    1.002997002997003
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1000, 1, 1]), 0.5, interpolate=True)
    2.000999000999001
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1000, 1]), 0.5, interpolate=True)
    2.999000999000999
    >>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1, 1000]), 0.5, interpolate=True)    
    3.9970029970029968
    """
    i = values.argsort()
    sorted_weights = weights[i]
    sorted_values = values[i]
    Sn = sorted_weights.cumsum()

    if interpolate:
        Pn = (Sn - sorted_weights/2 ) / Sn[-1]
        return np.interp(quantiles, Pn, sorted_values)
    else:
        return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]

After a discussion here

The interpolated=True version serves my practical purpose well (I also have an equivalent, badly implemented Julia function that I do not dare share here).

JuliaStats / StatsBase.jl

wmedian: the first weight is ignored when ProbabilityWeights is used #933