JuliaStats / Distributions.jl

A Julia package for probability distributions and associated functions.
Other
1.11k stars 418 forks source link

Inconsistent handling of endpoints in truncated distributions #632

Closed eford closed 3 years ago

eford commented 7 years ago

@dch216 found the following bug...

julia>   using Distributions
julia>   d = Truncated(Poisson(0.1),0.0,10.0)
Truncated(Distributions.Poisson{Float64}(λ=0.1), range=(0.0, 10.0))
julia> rand(d,10)
10-element Array{Int64,1}:
 1
 1
 1
 1
 1
 1
 1
 1
 1
 1

When of course, it should include lots of zeros like:

julia> rand(Poisson(0.1),10)
10-element Array{Int64,1}:
 0
 0
 0
 0
 1
 0
 0
 0
 0
 0

It looks like the problem is that

julia> d.lcdf
0.9048374180359595

when it should be zero.
This bug may affect lower truncation of all discretet distrubtions.

johnmyleswhite commented 7 years ago

Indeed, it seems like the truncation code is not handling points exactly at the boundary correctly. I'm not even sure what the semantics for discrete distributions were supposed to be. Are endpoints meant to be included or not? If so, the CDF calls could be replaced with CDF + PDF calls to account for the points with non-zero probability mass.

johnmyleswhite commented 7 years ago

Should we standardize on semi-open intervals like in the Wikipedia article on truncated distributions? IIUC, this is what Mathematica does:

andreasnoack commented 7 years ago

I agree that this is because of inconsistencies of how we handle the end points. The behavior of rand might actually be okay if the left endpoint is assumed to be excluded. I think that would be fine if we just document it better. However, pdf assumes that the left endpoint is included so we have the nonsense

julia> d = Truncated(Poisson(0.1),0.0,10.0)
Truncated(Distributions.Poisson{Float64}(λ=0.1), range=(0.0, 10.0))

julia> pdf(d, 0)
9.508331944775044

whereas cdf excludes the both endpoints

julia> cdf(d, 0)
0.0
eford commented 7 years ago

Yes, you could define the problem away. But I think that's still going to lead to unnecessary bugs where users expect a certain behavior and don't read the documentation word for word. I might suggest that the best option would be to have multiple versions of Truncated that clearly specify their behavior at boundaries. For exampe: TruncatedDistribution could require a Continuous Distribution to avoid unnecessary verbage and perserve compatability. For Discrete Distributions, you could require users to choose from TruncatedDistributionClosedClosed, TruncatedDistributionOpenOpen, TruncatedDistributionClosedOpen, and TruncatedDistributionOpenClosed.

That's slightly verbose, but I think it would minimize risk of bugs.

If you wanted to unify the code base for all four, then you could implement the above types via two extra type parameters like TruncatedDistribution{D<:DiscreteDistribution,B1<:BoundaryType,B2<:BoundaryType}

Thanks!

On Wed, Jul 5, 2017 at 9:48 AM, Andreas Noack notifications@github.com wrote:

I agree that this is because of inconsistencies of how we handle the end points. The behavior of rand might actually be okay if the left endpoint is assumed to be excluded. I think that would be fine if we just document it better. However, pdf assumes that the left endpoint is included so we have the nonsense

julia> d = Truncated(Poisson(0.1),0.0,10.0) Truncated(Distributions.Poisson{Float64}(λ=0.1), range=(0.0, 10.0))

julia> pdf(d, 0) 9.508331944775044

whereas cdf excludes the both endpoints

julia> cdf(d, 0) 0.0

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/Distributions.jl/issues/632#issuecomment-313108037, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQywniTbLi1SsUAx-WPGj-Mg7-JeuI9ks5sK5QVgaJpZM4OMdDD .

-- Eric Ford Professor of Astronomy & Astrophysics Center for Exoplanets & Habitable Worlds Center for Astrostatistics Institute for CyberScience Penn State Astrobiology Research Center Pennsylvania State University

andreasnoack commented 7 years ago

I think that is overly verbose without adding any functionality. I'd prefer simplicity and some good examples in the docstring. @johnmyleswhite's references suggest that it is more common to exclude the left endpoint. The main problem right now is that different functions in Distributions make different assumptions about the endpoints.

eford commented 7 years ago

I think that is overly verbose without adding any functionality.

It would add the functionality of allowing people to easily specify whether they want an open or close boundary for their truncation interval.

I'd prefer simplicity and some good examples in the docstring.

I'd prefer to minimize risk of bugs due to non-standard behavior.

@johnmyleswhite https://github.com/johnmyleswhite's references suggest that it is more common to exclude the left endpoint. The main problem right now is that different functions in Distributions make different assumptions about the endpoints.

Sorry, but wikipedia is not a good authority for mathematical conventions. Like it or not, R is still the de facto standard for statistical software. Looking at R's truncdist package, I see that it assumes that the lower bound is included in the truncated distribution.

install.packages('truncdist') library(truncdist) rtrunc(10,spec="pois",a=1,b=10,0.01) [1] 2 2 2 2 2 2 2 2 2 2

I think if there's going to be default behavior, it should match R's.

Cheers, Eric

On Wed, Jul 5, 2017 at 2:49 PM, Andreas Noack notifications@github.com wrote:

I think that is overly verbose without adding any functionality. I'd prefer simplicity and some good examples in the docstring. @johnmyleswhite https://github.com/johnmyleswhite's references suggest that it is more common to exclude the left endpoint. The main problem right now is that different functions in Distributions make different assumptions about the endpoints.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/Distributions.jl/issues/632#issuecomment-313192901, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQywk6Cmq1TzVrhi6h9g6W_yw2Eyg9Kks5sK9rTgaJpZM4OMdDD .

-- Eric Ford Professor of Astronomy & Astrophysics Center for Exoplanets & Habitable Worlds Center for Astrostatistics Institute for CyberScience Penn State Astrobiology Research Center Pennsylvania State University

andreasnoack commented 7 years ago

It would add the functionality of allowing people to easily specify whether they want an open or close boundary for their truncation interval.

I think you know what I meant. A single open/closed default will allow you to specify all the truncated distributions you'd like. Generally, we try to avoid redundancy but, of course, there can sometimes be a tradeoff.

I'd prefer to minimize risk of bugs due to non-standard behavior.

I'm not convinced that excluding the left endpoint is more surprising than including it. See also below.

I think if there's going to be default behavior, it should match R's.

It looks like R also excludes the left endpoint so I'm not sure what you are arguing for here.

eford commented 7 years ago

Whoops. I was trying to reply quickly before having to deal with meetings and such.
First, you're right the R behavior is also weird and what I consider to be unexpected.
As you noted, it's treating the lower truncation point as an open boundary.
But it's treating the upper truncation point as a closed boundary.

rtrunc(10,spec="pois",a=1,b=10,100.0) [1] 10 10 10 10 9 10 10 10 10 9

Complicating things further, in the paper that truncdist cites (https://www.jstatsoft.org/article/view/v016c02 ), Eqn 1 specifies a closed interval on both sides.

On one hand, I can see some value in matching R's behavior by default, even though that seems very weird and counter-intuitive to me. On the otherhand, I think R's default behavior also invites bugs.
FWIW, Based on those to factors, I still like the idea of forcing users to be explicit when dealing with discrete distributions.

bdeonovic commented 7 years ago

I agree that we should match R, and I don't think the left open, right closed standard is weird and un-intuitive largely because of the standard for CDFs is: The probability that X lies in the semi-closed interval (a, b], where a < b, is P(a< X <= b)=F(b)-F(a).

devmotion commented 3 years ago

The example in the OP was fixed by https://github.com/JuliaStats/Distributions.jl/pull/1386.