JuliaStats / StatsFuns.jl

Mathematical functions related to statistics.
Other
235 stars 40 forks source link

Wrong quantiles for the binomial distribution #170

Open Martin-exp-z2 opened 1 month ago

Martin-exp-z2 commented 1 month ago

Hello everyone,

I have noticed that the call quantile(Binomial(5000, 0.998575), 0.0005) returns 5000, which is wrong. Consider:

cdf(Binomial(5000, 0.998575), 4982)
## 0.0004383433701278433
cdf(Binomial(5000, 0.998575), 4983)
## 0.001143721045677184

which return correct values (virtually the same as the corresponding calls of scipy.stats.binom.cdf in Python). Thus, the above-mentioned call of quantile() should return 4983.

In addition, in my opinion, it would be more natural and useful if the function cquantile() returned the upper quantile. In particular, this would allow the user to get the middle median by calling (quantile(dist, 0.5) + cquantile(dist, 0.5))/2.

I am using Distributions v0.25.111 in Julia v1.10.5.

Thank you very much in advance for considering the issue!

andreasnoack commented 1 month ago

Thanks for reporting this. The issue is actually in StatsFuns so I will move the issue.

julia> binominvcdf(5000.0, 0.998575, 0.0005)
5000.0

However, for the quantile function, we actually rely on Rmath so it is a bit surprising and might be an issue with the version of Rmath that we use.

andreasnoack commented 1 month ago

Looked a bit into this and it appears that it is actually a bug in R that was fixed relatively recently, see https://bugs.r-project.org/show_bug.cgi?id=18711. So we should upgrade Rmath-julia and Rmath_jll. I'm initiating the process.

andreasnoack commented 1 month ago

Once https://github.com/JuliaPackaging/Yggdrasil/pull/9413 has been merged, this issue should be resolved. It would be good to add a test case, though. @Martin-exp-z2 would you be able to prepare a PR with a test?

andreasnoack commented 1 month ago

I can confirm that this now works

(StatsFuns) pkg> up
    Updating registry at `~/.julia/registries/General.toml`

   Installed Rmath ───── v0.8.0
   Installed Rmath_jll ─ v0.5.1+0
  Downloaded artifact: Rmath
    Updating `~/.julia/dev/StatsFuns/Project.toml`
  [79098fc4] ↑ Rmath v0.7.1 ⇒ v0.8.0
    Updating `~/.julia/dev/StatsFuns/Manifest.toml`
  [79098fc4] ↑ Rmath v0.7.1 ⇒ v0.8.0
  [f50d1b31] ↑ Rmath_jll v0.4.3+0 ⇒ v0.5.1+0
Precompiling project...
  3 dependencies successfully precompiled in 4 seconds. 12 already precompiled.

julia> using StatsFuns

julia> binominvcdf(5000.0, 0.998575, 0.0005)
4983.0
Martin-exp-z2 commented 1 month ago

@andreasnoack , thank you very much indeed for fixing the issue! I was not aware that the computation of quantiles in Julia actually relies on R routines. In future, it might be a good idea to write all the computation in Julia itself: I presume that this would make it even more efficient. However, I am aware that this requires considerable effort, which might be better spent elsewhere.

I am ready to prepare a program which would test the correctness of the computation of quantiles of instances of the binomial distribution suitably chosen at random (both parameters as well as quantile levels). Would that be OK? However, I am quite busy in general. Should the program not be ready in a few days, please remind me.

andreasnoack commented 1 month ago

In future, it might be a good idea to write all the computation in Julia itself

We are working towards that, see https://github.com/JuliaStats/StatsFuns.jl/issues/37, but indeed it takes time to implement these functions. We try to compose these out if functions from SpecialFunctions.jl and HypergeometricFunctions.jl to avoid doing too much of the heavy lifting here. For the binomial distribution, we have everything except for quantile functions (plural because of the complementary and log versions). Whenever, we don't have native implementations, we fall back to using Rmath, see e.g. https://github.com/JuliaStats/StatsFuns.jl/blob/5d0bf33f370ffcab335b46683903acd51ad08135/src/distrs/binom.jl#L4-L14

Regarding the testing then we mostly rely on testing against values from R since R is generally considered battle tested and we test a relatively small number of combinations. See e.g. https://github.com/JuliaStats/StatsFuns.jl/blob/5d0bf33f370ffcab335b46683903acd51ad08135/test/rmath.jl#L219-L231. Hence, my suggestion was actually just to append the values I linked to with the values that you used when reporting the issue. A more elaborate test setup would also be useful but it could easily take some time to complete so I think it is fine with the smaller test for now.

Martin-exp-z2 commented 1 month ago

@andreasnoack , could you please help me update to the newest version of Rmath? I get:

(@v1.10) pkg> st Rmath
Status `~/.julia/environments/v1.10/Project.toml`
⌅ [79098fc4] Rmath v0.7.1
Info Packages marked with ⌅ have new versions available but compatibility constraints restrict them from upgrading. To see why use `status --outdated`

(@v1.10) pkg> status --outdated
Status `~/.julia/environments/v1.10/Project.toml`
⌅ [79098fc4] Rmath v0.7.1 (<v0.8.0): StatsFuns

(@v1.10) pkg> up
    Updating registry at `~/.julia/registries/General.toml`
  No Changes to `~/.julia/environments/v1.10/Project.toml`
  No Changes to `~/.julia/environments/v1.10/Manifest.toml`

Many thanks!

andreasnoack commented 1 month ago

I just realized that a new version of StatsFuns is needed here because of the minor version bump of Rmath. Once https://github.com/JuliaRegistries/General/pull/115053 is available, you should be able to upgrade.

Martin-exp-z2 commented 1 month ago

@andreasnoack , great! Now it works on my computer, too! Please find a testing program below (I was unable to attach it as a file, but maybe it is meant to be this way). After testing 2M randomly selected instances of the binomial distribution, no error so far.

#!/usr/bin/env julia

#=
Tests the correctness of quantiles of randomly selected instances of the
binomial distribution.

The number of trials is chosen according to the geometric distribution.
For each selected number of trials, two success probabilities are selected:
one according the uniform distribution and the other one according to a
beta distribution concentrated near a randomly selected endpoint 0 or 1.

Martin Raič (GitHub: Martin-exp-z2), September 2024
=#

using Distributions
using StatsFuns
using Printf

### SETTINGS

# Choice of instances of the binomial family

"2 N instances are being tested."
N = 1000000

"Expected number of trials"
n_e = 5000

"""
Typically, the expected distance of the success probability to the boundary
multiplied by the number of trials.
Refers to the success probability chosen according to the beta distribution.
"""
p_e = 20

"Testing is roughly restricted to levels between tol/2 and 1 - tol/2"
tol = 1e-6

"The function which actually computes quantiles of the binomial distrubutions"
# q_binom(n::Integer, p::Real, l::Real) = quantile(Binomial(n, p), l)
q_binom = binominvcdf

# Display settings

"""
Possible values:
  :DETAILED : each instance of the binomial distribution being tested is displayed
  :PROGRESS : progress is displayed
"""
output_mode = :PROGRESS

"Determines how often the progress is refreshed."
progress_perc_span = 1

progress_perc_format(p::Real) = @sprintf("%.0f", p)

### MAIN FUNCTIONS

"""
Tests a quantile of Binomial(n, p) at a random level,
so chosen that the lower quantile at this level equals k.
For X ~ Binomial(n, p), there should be
cp0 = P(X < k) and cp1 = P(X <= k).
Thus, the latter two parameters are meant to be pre-computed.
The level l is chosen so that cp0 < l <= cp1.
"""
function q_binom_test(n::Integer, p::Real, k::Integer, cp0::Real, cp1::Real)
  l = cp1 - rand()*(cp1 - cp0)
  q = q_binom(n, p, l)
  if q != k
    println("""

    The quantile of binomial($n, $p) at level $l does not match:
    the returned value is $q, while the correct value is $k.
    """)
    exit(-1)
  end
end

"""
Tests quantiles of Binomial(n, p).
For each value k with P(X <= k), P(X >= k) >= tol/2,
q_binom_test(n, p, k, cp0, cp1) is called with appropriate cp0 and cp1.
"""
function q_binom_test(n::Integer, p::Real)
  if output_mode == :DETAILED
    print("Testing quantiles of Binomial($n, $p) ... ")
  end
  kk = round(Int, n*p)
  cp = cdf(Binomial(n, p), kk)
  k = kk
  cp1 = cp
  while cp1 >= tol/2
    cp0 = cdf(Binomial(n, p), k - 1)
    q_binom_test(n, p, k, cp0, cp1)
    k -= 1
    cp1 = cp0
  end
  ccp0 = 1 - cp
  k = kk + 1
  while ccp0 >= tol/2
    ccp1 = ccdf(Binomial(n, p), k)
    q_binom_test(n, p, k, 1 - ccp0, 1 - ccp1)
    k += 1
    ccp0 = ccp1
  end
  if output_mode == :DETAILED
    println("OK")
  end
end

### PROGRESS DISPLAYING UTILITIES

let

  global display_progress

  len = 0
  last_span = -1

  function display_progress(it::Integer)
    if output_mode == :PROGRESS
      perc = 100*it/N
      span = round(Int, perc/progress_perc_span)
      if span > last_span
        last_span = span
        update = true
      else
        update = (it == N)
      end
      if update
        perc_s = progress_perc_format(perc)
        len = max(len, length(perc_s))
        print("Progress: $(rpad(perc_s, len)) %\r")
      end
    end
  end

end

### MAIN PART

println("Testing quantiles of the binomial distribution ...")
display_progress(0)
for it in range(1, N)
  n = rand(Geometric(1/n_e)) + 1
  q_binom_test(n, rand())
  b = max(n/p_e - 1, 1)  # Typically, n/p_e - 1 .
  q_binom_test(n, rand(rand((false, true)) ? Beta(b, 1) : Beta(1, b)))
  display_progress(it)
end
print("\nAll tested quantiles match!\n")