Closed eford closed 3 years ago
Indeed, it seems like the truncation code is not handling points exactly at the boundary correctly. I'm not even sure what the semantics for discrete distributions were supposed to be. Are endpoints meant to be included or not? If so, the CDF calls could be replaced with CDF + PDF calls to account for the points with non-zero probability mass.
Should we standardize on semi-open intervals like in the Wikipedia article on truncated distributions? IIUC, this is what Mathematica does:
I agree that this is because of inconsistencies of how we handle the end points. The behavior of rand
might actually be okay if the left endpoint is assumed to be excluded. I think that would be fine if we just document it better. However, pdf
assumes that the left endpoint is included so we have the nonsense
julia> d = Truncated(Poisson(0.1),0.0,10.0)
Truncated(Distributions.Poisson{Float64}(λ=0.1), range=(0.0, 10.0))
julia> pdf(d, 0)
9.508331944775044
whereas cdf
excludes the both endpoints
julia> cdf(d, 0)
0.0
Yes, you could define the problem away. But I think that's still going to lead to unnecessary bugs where users expect a certain behavior and don't read the documentation word for word. I might suggest that the best option would be to have multiple versions of Truncated that clearly specify their behavior at boundaries. For exampe: TruncatedDistribution could require a Continuous Distribution to avoid unnecessary verbage and perserve compatability. For Discrete Distributions, you could require users to choose from TruncatedDistributionClosedClosed, TruncatedDistributionOpenOpen, TruncatedDistributionClosedOpen, and TruncatedDistributionOpenClosed.
That's slightly verbose, but I think it would minimize risk of bugs.
If you wanted to unify the code base for all four, then you could implement the above types via two extra type parameters like TruncatedDistribution{D<:DiscreteDistribution,B1<:BoundaryType,B2<:BoundaryType}
Thanks!
On Wed, Jul 5, 2017 at 9:48 AM, Andreas Noack notifications@github.com wrote:
I agree that this is because of inconsistencies of how we handle the end points. The behavior of rand might actually be okay if the left endpoint is assumed to be excluded. I think that would be fine if we just document it better. However, pdf assumes that the left endpoint is included so we have the nonsense
julia> d = Truncated(Poisson(0.1),0.0,10.0) Truncated(Distributions.Poisson{Float64}(λ=0.1), range=(0.0, 10.0))
julia> pdf(d, 0) 9.508331944775044
whereas cdf excludes the both endpoints
julia> cdf(d, 0) 0.0
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/Distributions.jl/issues/632#issuecomment-313108037, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQywniTbLi1SsUAx-WPGj-Mg7-JeuI9ks5sK5QVgaJpZM4OMdDD .
-- Eric Ford Professor of Astronomy & Astrophysics Center for Exoplanets & Habitable Worlds Center for Astrostatistics Institute for CyberScience Penn State Astrobiology Research Center Pennsylvania State University
I think that is overly verbose without adding any functionality. I'd prefer simplicity and some good examples in the docstring. @johnmyleswhite's references suggest that it is more common to exclude the left endpoint. The main problem right now is that different functions in Distributions
make different assumptions about the endpoints.
I think that is overly verbose without adding any functionality.
It would add the functionality of allowing people to easily specify whether they want an open or close boundary for their truncation interval.
I'd prefer simplicity and some good examples in the docstring.
I'd prefer to minimize risk of bugs due to non-standard behavior.
@johnmyleswhite https://github.com/johnmyleswhite's references suggest that it is more common to exclude the left endpoint. The main problem right now is that different functions in Distributions make different assumptions about the endpoints.
Sorry, but wikipedia is not a good authority for mathematical conventions. Like it or not, R is still the de facto standard for statistical software. Looking at R's truncdist package, I see that it assumes that the lower bound is included in the truncated distribution.
install.packages('truncdist') library(truncdist) rtrunc(10,spec="pois",a=1,b=10,0.01) [1] 2 2 2 2 2 2 2 2 2 2
I think if there's going to be default behavior, it should match R's.
Cheers, Eric
On Wed, Jul 5, 2017 at 2:49 PM, Andreas Noack notifications@github.com wrote:
I think that is overly verbose without adding any functionality. I'd prefer simplicity and some good examples in the docstring. @johnmyleswhite https://github.com/johnmyleswhite's references suggest that it is more common to exclude the left endpoint. The main problem right now is that different functions in Distributions make different assumptions about the endpoints.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/JuliaStats/Distributions.jl/issues/632#issuecomment-313192901, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQywk6Cmq1TzVrhi6h9g6W_yw2Eyg9Kks5sK9rTgaJpZM4OMdDD .
-- Eric Ford Professor of Astronomy & Astrophysics Center for Exoplanets & Habitable Worlds Center for Astrostatistics Institute for CyberScience Penn State Astrobiology Research Center Pennsylvania State University
It would add the functionality of allowing people to easily specify whether they want an open or close boundary for their truncation interval.
I think you know what I meant. A single open/closed default will allow you to specify all the truncated distributions you'd like. Generally, we try to avoid redundancy but, of course, there can sometimes be a tradeoff.
I'd prefer to minimize risk of bugs due to non-standard behavior.
I'm not convinced that excluding the left endpoint is more surprising than including it. See also below.
I think if there's going to be default behavior, it should match R's.
It looks like R also excludes the left endpoint so I'm not sure what you are arguing for here.
Whoops. I was trying to reply quickly before having to deal with meetings and such.
First, you're right the R behavior is also weird and what I consider to be unexpected.
As you noted, it's treating the lower truncation point as an open boundary.
But it's treating the upper truncation point as a closed boundary.
rtrunc(10,spec="pois",a=1,b=10,100.0) [1] 10 10 10 10 9 10 10 10 10 9
Complicating things further, in the paper that truncdist cites (https://www.jstatsoft.org/article/view/v016c02 ), Eqn 1 specifies a closed interval on both sides.
On one hand, I can see some value in matching R's behavior by default, even though that seems very weird and counter-intuitive to me. On the otherhand, I think R's default behavior also invites bugs.
FWIW, Based on those to factors, I still like the idea of forcing users to be explicit when dealing with discrete distributions.
I agree that we should match R, and I don't think the left open, right closed standard is weird and un-intuitive largely because of the standard for CDFs is: The probability that X lies in the semi-closed interval (a, b]
, where a < b
, is P(a< X <= b)=F(b)-F(a)
.
The example in the OP was fixed by https://github.com/JuliaStats/Distributions.jl/pull/1386.
@dch216 found the following bug...
When of course, it should include lots of zeros like:
It looks like the problem is that
when it should be zero.
This bug may affect lower truncation of all discretet distrubtions.