JuliaStats / Statistics.jl

The Statistics stdlib that ships with Julia.
https://juliastats.org/Statistics.jl/dev/
Other
71 stars 40 forks source link

Incorrect quantiles for floating-point and integer arrays #144

Closed yurivish closed 1 year ago

yurivish commented 1 year ago

Say we have an array with a large negative number and a smaller positive number:

array = Float16[-9000, 100]

Both of these numbers are represented exactly in Float16:

julia> Float16(-9000) == -9_000
true

julia> Float16(100) == 100
true

Julia will return the wrong answer for quantile queries over this array:

julia> using Statistics

julia> quantile(Float16[-9000, 100], 1.0)
104.0 # should be 100, not 104

julia> quantile(Float16[-9000, 100], 0.999999999)
103.99999089599987 # should be approximately 100, not 104

If we make the large number bigger, then quantile queries over Float32 and Float64 arrays are also incorrect:

julia> quantile(Float32[-1e9, 100], 1)
128.0f0 # should be 100, not 128

julia> quantile(Float32[-1e9, 100], 0.9999999999)
127.89999997615814 # should be approximately 100, not 128

julia> quantile(Float64[-1e20, 100], 1)
0.0 # should be 100, not 0

This can happen when both numbers are small:

julia> quantile(Float16[-255, 0.58], 1)
Float16(0.625) # should be 0.58, not 0.625

And it can happen when the array contains more than two numbers:

julia> quantile(Float32[-1e15, -1e14, -1e13, -1e12, -1e11, -1e10, -1e9, 100], 1)
128.0f0 # should be 100, not 128

For integers there’s the interesting twist – the quantiles can exceed the representable values of the integer type:

julia> quantile(Int8[-68, 60], 0.5)
-132.0 # should be -4, not less than typemin(Int8)

In some cases every quantile other than the 0th percentile is incorrect. Interestingly, the values decrease as we query successively higher percentiles:

julia> for percentile in 0:0.1:1
          arr = Int16[-30000, 30000]
          result = quantile(arr, percentile)
          println("p$(rpad(Int(100*percentile), 3)) = $result")
       end
p0   = -30000.0 # correct!
p10  = -30553.600000000002 # wrong
p20  = -31107.2 # wrong
p30  = -31660.8 # wrong
p40  = -32214.4 # wrong
p50  = -32768.0 # wrong
p60  = -33321.6 # wrong
p70  = -33875.2 # wrong
p80  = -34428.8 # wrong
p90  = -34982.4 # wrong
p100 = -35536.0 # wrong

If the numbers are big enough, comparable results can be found for Int32 and Int64:

julia> quantile(Int32[-1e9, 2e9], 1.0) < typemin(Int32)
true # should be false

julia> quantile(Int[-5e18, -2e18, 9e18], 1.0) < typemin(Int)
true # should be false

Randomized testing suggests that for Int32 this behavior occurs more frequently for short integer arrays. Based on a million samples at each length, approximately

For Int64 it looks like the error rate may not go down as array size decreases, and approximately

The function I used to compute these estimates ```julia """ Estimate the probability that incorrect quantiles will be produced for a random array of a particular size and element type. Correctness is determined using a very naïve method where we only consider a quantile result to be incorrect if it lies outside the extrema of the array. """ function estimate_wrong_int_arrays(T, n, n_samples) count = 0 percentiles = 0:0.01:1 arr = zeros(T, n) for _ in 1:n_samples rand!(arr) # Since quantiles are returned in Float64 precision, # convert the extrema to that type for comparisons lo, hi = Float64.(extrema(arr)) for p in percentiles result = quantile(arr, p) quantile_in_bounds = lo ≤ result ≤ hi if !quantile_in_bounds count += 1 break end end end count / n_samples end ```

This behavior reproduces on the current LTS release, Julia 1.6, as well as Julia 1.9.1, the current release as of 2023-06-07.

nalimilan commented 1 year ago

Thanks. The problem seems to be due to this line: https://github.com/JuliaStats/Statistics.jl/blob/f13706e623a7a5e99eade66b9dd1233b2255d490/src/Statistics.jl#L1014

E.g. with quantile(Float64[-1e20, 100], 1):

julia> (a, b, γ) = (-1.0e20, 100.0, 1)
(-1.0e20, 100.0, 1)

julia> a + γ*(b-a)
0.0

julia> (1-γ)*a + γ*b
100.0

We could change it to use (1-γ)*a + γ*b instead. But the current strategy was added by https://github.com/JuliaLang/julia/pull/16572 to fix the fact that in some cases quantiles were not increasing as they should with p. Test cases added by that PR don't pass with that change.

Adding a small quantity like 4eps() to aleph at this line seems to fix the problem (this is what R does): https://github.com/JuliaStats/Statistics.jl/blob/f13706e623a7a5e99eade66b9dd1233b2255d490/src/Statistics.jl#L1001

@andreasnoack What do you think?

andreasnoack commented 1 year ago

I tried to read through the old issue. It's not clear to me why the original version causes unsorted quantiles.

nalimilan commented 1 year ago

Actually it's not the original/current version which causes unsorted quantiles, it's the one I tested ((1-γ)*a + γ*b) to fix this issue.

Here's what happens using the test case from https://github.com/JuliaLang/julia/pull/16572 and making the quantile print the values (returned quantile is the one with the (1-γ)*a + γ*b formula, i.e. the buggy one). It appears that a and b are equal to 0.41662034698690303, and γ changes from 0.2137500000000001 for quantile 6 to 0.6425000000000001 in quantile 7.

julia> y = [0.40003674665581906,0.4085630862624367,0.41662034698690303,0.41662034698690303,0.42189053966652057,0.42189053966652057,0.42553514344518345,0.43985732442991354]

julia> quantile(y, range(0.01, 0.99, length=17)[6])
(a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.2137500000000001)
0.4166203469869031

julia> (a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.2137500000000001)
(0.41662034698690303, 0.41662034698690303, 0.2137500000000001)

julia> a + γ*(b-a)
0.41662034698690303

julia> (1-γ)*a + γ*b
0.4166203469869031

julia> quantile(y, range(0.01, 0.99, length=17)[7])
(a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.6425000000000001)
0.416620346986903

julia> (a, b, γ) = (0.41662034698690303, 0.41662034698690303, 0.6425000000000001)
(0.41662034698690303, 0.41662034698690303, 0.6425000000000001)

julia> a + γ*(b-a)
0.41662034698690303

julia> (1-γ)*a + γ*b
0.416620346986903

I'm not sure what can be done about this. We could easily check whether a == b and return one of them in that case. But the same issue can happen with very close numbers, like this (taking the same numbers as before):

julia> (a, γ) = (0.41662034698690303, 0.2137500000000001)
(0.41662034698690303, 0.2137500000000001)

julia> (1-γ)*nextfloat(a) + γ*a
0.4166203469869031

julia> (a, γ) = (0.41662034698690303, 0.6425000000000001)
(0.41662034698690303, 0.6425000000000001)

julia> (1-γ)*nextfloat(a) + γ*a
0.41662034698690303

Maybe we should check whether the result is approximately equal to a or b, or whether a and b are approximately equal, and if so return a + γ*(b-a) like before (since with almost equal numbers there's no risk of precision loss due to subtraction)? This is almost equivalent to always returning a (or b...) but slightly cleaner.