Closed yurivish closed 1 year ago
Thanks. The problem seems to be due to this line: https://github.com/JuliaStats/Statistics.jl/blob/f13706e623a7a5e99eade66b9dd1233b2255d490/src/Statistics.jl#L1014
E.g. with quantile(Float64[-1e20, 100], 1)
:
julia> (a, b, γ) = (-1.0e20, 100.0, 1)
(-1.0e20, 100.0, 1)
julia> a + γ*(b-a)
0.0
julia> (1-γ)*a + γ*b
100.0
We could change it to use (1-γ)*a + γ*b
instead. But the current strategy was added by https://github.com/JuliaLang/julia/pull/16572 to fix the fact that in some cases quantiles were not increasing as they should with p
. Test cases added by that PR don't pass with that change.
Adding a small quantity like 4eps()
to aleph
at this line seems to fix the problem (this is what R does):
https://github.com/JuliaStats/Statistics.jl/blob/f13706e623a7a5e99eade66b9dd1233b2255d490/src/Statistics.jl#L1001
@andreasnoack What do you think?
I tried to read through the old issue. It's not clear to me why the original version causes unsorted quantiles.
Actually it's not the original/current version which causes unsorted quantiles, it's the one I tested ((1-γ)*a + γ*b
) to fix this issue.
Here's what happens using the test case from https://github.com/JuliaLang/julia/pull/16572 and making the quantile
print the values (returned quantile is the one with the (1-γ)*a + γ*b
formula, i.e. the buggy one). It appears that a
and b
are equal to 0.41662034698690303
, and γ
changes from 0.2137500000000001
for quantile 6 to 0.6425000000000001
in quantile 7.
julia> y = [0.40003674665581906,0.4085630862624367,0.41662034698690303,0.41662034698690303,0.42189053966652057,0.42189053966652057,0.42553514344518345,0.43985732442991354]
julia> quantile(y, range(0.01, 0.99, length=17)[6])
(a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.2137500000000001)
0.4166203469869031
julia> (a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.2137500000000001)
(0.41662034698690303, 0.41662034698690303, 0.2137500000000001)
julia> a + γ*(b-a)
0.41662034698690303
julia> (1-γ)*a + γ*b
0.4166203469869031
julia> quantile(y, range(0.01, 0.99, length=17)[7])
(a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.6425000000000001)
0.416620346986903
julia> (a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.6425000000000001)
(0.41662034698690303, 0.41662034698690303, 0.6425000000000001)
julia> a + γ*(b-a)
0.41662034698690303
julia> (1-γ)*a + γ*b
0.416620346986903
I'm not sure what can be done about this. We could easily check whether a == b
and return one of them in that case. But the same issue can happen with very close numbers, like this (taking the same numbers as before):
julia> (a, γ) = (0.41662034698690303, 0.2137500000000001)
(0.41662034698690303, 0.2137500000000001)
julia> (1-γ)*nextfloat(a) + γ*a
0.4166203469869031
julia> (a, γ) = (0.41662034698690303, 0.6425000000000001)
(0.41662034698690303, 0.6425000000000001)
julia> (1-γ)*nextfloat(a) + γ*a
0.41662034698690303
Maybe we should check whether the result is approximately equal to a
or b
, or whether a
and b
are approximately equal, and if so return a + γ*(b-a)
like before (since with almost equal numbers there's no risk of precision loss due to subtraction)? This is almost equivalent to always returning a
(or b
...) but slightly cleaner.
Say we have an array with a large negative number and a smaller positive number:
Both of these numbers are represented exactly in
Float16
:Julia will return the wrong answer for quantile queries over this array:
If we make the large number bigger, then quantile queries over
Float32
andFloat64
arrays are also incorrect:This can happen when both numbers are small:
And it can happen when the array contains more than two numbers:
For integers there’s the interesting twist – the quantiles can exceed the representable values of the integer type:
In some cases every quantile other than the 0th percentile is incorrect. Interestingly, the values decrease as we query successively higher percentiles:
If the numbers are big enough, comparable results can be found for
Int32
andInt64
:Randomized testing suggests that for
Int32
this behavior occurs more frequently for short integer arrays. Based on a million samples at each length, approximatelyFor
Int64
it looks like the error rate may not go down as array size decreases, and approximatelyThe function I used to compute these estimates
```julia """ Estimate the probability that incorrect quantiles will be produced for a random array of a particular size and element type. Correctness is determined using a very naïve method where we only consider a quantile result to be incorrect if it lies outside the extrema of the array. """ function estimate_wrong_int_arrays(T, n, n_samples) count = 0 percentiles = 0:0.01:1 arr = zeros(T, n) for _ in 1:n_samples rand!(arr) # Since quantiles are returned in Float64 precision, # convert the extrema to that type for comparisons lo, hi = Float64.(extrema(arr)) for p in percentiles result = quantile(arr, p) quantile_in_bounds = lo ≤ result ≤ hi if !quantile_in_bounds count += 1 break end end end count / n_samples end ```This behavior reproduces on the current LTS release, Julia 1.6, as well as Julia 1.9.1, the current release as of 2023-06-07.