JuliaStats / Statistics.jl

The Statistics stdlib that ships with Julia.
https://juliastats.org/Statistics.jl/dev/
Other
71 stars 40 forks source link

The `quantile` function can return incorrect results for integer arrays (Int8, Int16, Int32) #119

Closed yurivish closed 11 months ago

yurivish commented 2 years ago

For example, using quantile to compute the maximum of the array [-128, 127] incorrectly returns -129 if the array consists of Int8 values:

julia> using Statistics

julia> quantile([-128, 127], 1)     # right
127

julia> quantile(Int8[-128, 127], 1) # wrong
-129

This happens because of the following code in _quantile:

https://github.com/JuliaStats/Statistics.jl/blob/0588f2cf9e43f9f72af5802feaf0af4b652c3257/src/Statistics.jl#L1010

For the above array, that line computes the following expression, causing the incorrect result:

julia> Int8(-128) + Int64(Int8(127) - Int8(-128))
-129

This behavior occurs with arrays of Int8, Int16, and Int32:

julia> function test(type)
           A = [typemin(type), typemax(type)]
           result = quantile(A, 1)
           correct_result = quantile(map(Int, A), 1)
           (; result, correct_result)
       end;

julia> test(Int8)
(result = -129, correct_result = 127)

julia> test(Int16)
(result = -32769, correct_result = 32767)

julia> test(Int32)
(result = -2147483649, correct_result = 2147483647)

Using quantile to compute the median also returns an answer that is not correct:

julia> quantile(Int8[-128, 127], .5)
-128.5

The equivalent function in NumPy, np.quantile, returns the correct result in all cases:

julia> using PyCall

julia> np = pyimport("numpy");

julia> a = np.array(Int8[-128, 127], dtype=np.int8);

julia> np.quantile(a, 1)
127

julia> function test_python()
    quantiles = (1, 0.5)
    types = (
        Int8 => np.int8,
        Int16 => np.int16,
        Int32 => np.int32
    )
    for q in quantiles, (type, dtype) in types
        A = np.array([typemin(type), typemax(type)]; dtype)
        result = np.quantile(A, q)
        correct_result = quantile(map(Int, A), q)
        println((; q, type, result, correct_result))
    end
end;

julia> test_python()
(q = 1, type = Int8, result = 127, correct_result = 127)
(q = 1, type = Int16, result = 32767, correct_result = 32767)
(q = 1, type = Int32, result = 2147483647, correct_result = 2147483647)
(q = 0.5, type = Int8, result = -0.5, correct_result = -0.5)
(q = 0.5, type = Int16, result = -0.5, correct_result = -0.5)
(q = 0.5, type = Int32, result = -0.5, correct_result = -0.5)
nalimilan commented 11 months ago

Fixed by https://github.com/JuliaStats/Statistics.jl/pull/145.