JuliaStats / Distributions.jl

A Julia package for probability distributions and associated functions.
Other
1.1k stars 414 forks source link

Issue with support of Beta distribution when alpha < 1 or beta < 1 #1003

Open MFairley opened 4 years ago

MFairley commented 4 years ago

I am having a numerical issue with sampling from a Beta distribution when alpha or beta is less than 1. E.g. sometimes a sample will come back as exactly 1.0 but the support of the Beta distribution does not include 1 when beta is less than 1 and the pdf evaluates to Inf due to division by 0.

devmotion commented 4 years ago

I just ran into the same issue:

using Distributions
using Random
using Test

Random.seed!(1234)
v = rand(Beta(0.01, 1.01), 10_000)
@test_broken !any(iszero, v)

The standard implementation in R does not suffer from the same problem:

set.seed(1234)
v <- rbeta(n=10000, shape1=0.01, shape2=1.01)
0 %in% v && stop("incorrect samples")

It seems the standard implementation in R uses the algorithms BB and BC from R. C. H. Cheng (1978). Generating beta variates with nonintegral shape parameters. Communications of the ACM, 21, 317–322. In general, it might be interesting to have a look at Ying-Chao Hung, Narayanaswamy Balakrishnan & Yi-Te Lin (2009). Evaluation of Beta Generation Algorithms. Communications in Statistics - Simulation and Computation, 38:4, 750-770 for a comparison of different algorithms for sampling from the Beta distribution.

devmotion commented 4 years ago

I think the main issue with the current implementation is that the Gamma sampler returns incorrect values:

using Distributions
using Random
using Test

Random.seed!(1234)
v = rand(Gamma(0.01, 1.0), 10_000)
@test_broken !any(iszero, v)
andreasnoack commented 4 years ago

I think the main issue with the current implementation is that the Gamma sampler returns incorrect values:

Why do you conclude that the values are incorrect? The examples seems similar to

julia> any(iszero, rand(10000))
false

There is zero probability at zero. The probability in a small neighborhood looks okay

julia> x = rand(Gamma(0.01, 1.0), 10000);

julia> mean(t -> t < eps(), x)
0.7067

julia> cdf(Gamma(0.01, 1.0), eps())
0.7013514054165827
devmotion commented 4 years ago

The problem is that the sampler returns values of 0, which are not in the support of the Gamma distribution. In the case of the Beta distribution, for the parameter values mentioned above, the current sampling algorithm uses the Gamma sampler; if that one returns a value of 0 outside of the support of the Gamma distribution, the Beta sampler returns also a value of 0, which is not in the support of the Beta distribution.

andreasnoack commented 4 years ago

Doh. I misread your test. Indeed, it doesn't look right. Would be good to figure out why that happens.

andreasnoack commented 4 years ago

The problem is that the probability of observing a value smaller than the smallest (positive) Float64 is pretty large.

julia> cdf(Gamma(0.01, 1.0), floatmin(Float64))
0.0008432274068068664

so I'm not sure if there is much we can do for the Gamma. Maybe we can switch to a different algorithm for the Beta for these parameter values.

devmotion commented 4 years ago

According to some quick debugging, the problem seems to be https://github.com/JuliaStats/Distributions.jl/blob/a2304a8811f42a7203c76950ddf5f3a5bceb4533/src/samplers/gamma.jl#L210 Here x is sampled from Gamma(a + 1, theta), s.nia = -1 / a, and e = randexp(). For a = 0.01 sampling of x seems to be fine, but exp(s.nia * e) evaluates to 0 from time to time.

andreasnoack commented 4 years ago

Looks like this carries over to the Beta

julia> cdf(Beta(0.01, 1.01), floatmin(Float64))
0.0008385787538914535

so I'm not really sure what an be done.

devmotion commented 4 years ago

Hmm I see the problem but I still think that a sampler should only return valid samples (or throw an error), but not yield samples that are not in the support of the distribution.

devmotion commented 4 years ago

In addition, depending on the use case, one could sample real-valued X with exp(X) ~ Gamma(0.01, 1) directly, i.e., without sampling from Gamma(0.01, 1) (by, e.g., sampling log(x) + s.nia * e in the example above). This would then allow to sample real-valued Y with logistic(Y) ~ Beta(0.01, 1.01) by sampling X1 with exp(X1) ~ Gamma(0.01, 1) and X2 with exp(X2) ~ Gamma(1.01, 1) and setting Y = X1 - X2.

Of course, that would only be helpful if the downstream calculations are based on the logarithm of the samples from the Gamma distribution or the logit of the samples from the Beta distribution (which they were in my particular use case).

devmotion commented 4 years ago

The algorithm in https://arxiv.org/pdf/1302.1884.pdf seems to be an efficient method for sampling X with exp(X) ~ Gamma(a, 1) for small shape parameter a directly, without sampling exp(X) first. Hence X would be guaranteed to be finite even for very small shape parameters. According to the publication, there is a cut-off at a = 0.3 where the acceptance rate falls below the one of algorithm 3 by Kundu and Gupta.

ParadaCarleton commented 1 year ago

Hmm I see the problem but I still think that a sampler should only return valid samples (or throw an error), but not yield samples that are not in the support of the distribution.

Seems like we could just replace 0 with the next-largest float. Is there a problem with this approach?

devmotion commented 1 year ago

This would correspond to a truncated Beta distribution. As mentioned e.g. in https://github.com/JuliaStats/Distributions.jl/issues/1003#issuecomment-555992757, the mass between 0 and floatmin(Float64) can be non-negligible.

devmotion commented 1 year ago

Maybe we should just support sampling of inv-logistic(X) as discussed in https://github.com/JuliaStats/Distributions.jl/issues/1003#issuecomment-556978582 and point users to it for the problematic parameter choices.

ParadaCarleton commented 1 year ago

As mentioned e.g. in #1003 (comment), the mass between 0 and floatmin(Float64) can be non-negligible.

Right, but is there anything we can do about that mass? We can't return numbers we can't represent.

The two options I can see are:

  1. Rounding everything that's equal to zero(type) to eps(zero(type))
  2. Throw an error whenever we have zero(type), so users know there's potential numerical inaccuracies.

(Unless you're saying the current method is inaccurate for subnormal numbers, and are proposing we return more accurate subnormal numbers?)