Open perrette opened 4 months ago
I am the author of the function agree with all these issues — they all come down to the arbitrariness of the first weight. Additionally, the part of the weighted quantile
function where one removes the zero is probably wrong as it means that true zeroes and things equal to zero or close to zero will give discontinuous results.
That being said, I cannot think of a good fix. I guess I just really can't see how the quantile definition used by Julia can be naturally extended to weighted vector without implying weird edge cases. If you can find a more natural way, feel free to do a PR. Multiplying all weights by a large number so that they become integers would not really solve your issues since replacing a 0 weight by eps() would lead to very different results too.
Hi @matthieugomez, thanks for replying. I am not familiar with Julia 's quantile function, but if that can advance in any way this discussion, I use the following function in python:
import numpy as np
def weighted_quantiles(values, weights, quantiles, interpolate=False):
"""
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), 0.5)
2
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.ones(4), 0.5, interpolate=True)
2.5
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1000, 1, 1, 1]), 0.5)
1
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1000, 1, 1]), 0.5)
2
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1000, 1]), 0.5)
3
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1, 1000]), 0.5)
4
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1000, 1, 1, 1]), 0.5, interpolate=True)
1.002997002997003
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1000, 1, 1]), 0.5, interpolate=True)
2.000999000999001
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1000, 1]), 0.5, interpolate=True)
2.999000999000999
>>> weighted_quantiles(np.array([1, 2, 3, 4]), np.array([1, 1, 1, 1000]), 0.5, interpolate=True)
3.9970029970029968
"""
i = values.argsort()
sorted_weights = weights[i]
sorted_values = values[i]
Sn = sorted_weights.cumsum()
if interpolate:
Pn = (Sn - sorted_weights/2 ) / Sn[-1]
return np.interp(quantiles, Pn, sorted_values)
else:
return sorted_values[np.searchsorted(Sn, quantiles * Sn[-1])]
After a discussion here
The interpolated=True
version serves my practical purpose well (I also have an equivalent, badly implemented Julia function that I do not dare share here).
Hi, I clearly lack subtelty about the various definitions of weighted quantiles (and I passed quickly over the above discussion as a result), but I thought I'd share another, more obvious example of what what the current implementation is doing:
The frequency weights seems to do what I'd expect:
unlike the ProbabilityWeights:
which seems to:
Worse, it is numerically inaccurate:
whereas setting epsilon to exactly zero yields the same (and for me, expected) result as FrequencyWeights
IMO the above shows surprising results that may go beyond the difference between various definitions (especially that the first weight is ignored, and possibly the discontinuity at the limit when the weights tend toward being concentrated on one element, though OK, discontinuities are parts of mathematics -- but they often don't help when analyzing real data).
Anyway, for me the workaround will be to multiply my weights by a large number, convert to integers, and use FrequencyWeights instead of ProbabilityWeights.
Originally posted by @perrette in https://github.com/JuliaStats/StatsBase.jl/issues/435#issuecomment-2162412722