Closed ViralBShah closed 12 years ago
One open question is whether to bother making distributions generic to all real types or just stick with floats for continuous and ints for discrete. I went generic here, but it's pretty easy to argue that this is an over-generalization. Especially because functions like exp
, sqrt
and erf
are pretty much exclusively going to be floating-point.
Given our type system, I am tempted to say that we should stick with Ints for discrete. The only reason not to do so would be that mixing ints with other floating point code ends up being slower than otherwise.
Regarding the generic type issue, would SymmetricRealDistribution{T} be an abstract parameterized type? I imagine such things are defined in various base/*.jl files but to save me doing the grepping, could one of you block out the abstract type declarations starting at Distribution?
@dmbates: I was just sketching at this point. None of that exists. I was thinking that it would be parametric, but maybe it needn't be — continuous distributions could always just yield Float64
values for samples.
I think that treating all continuous distributions as using Float64's and all discrete integer distributions as Int64's isn't likely to a problem. (Of course, we do need to keep in mind that a distribution like the Poisson may have continuous parameters, but only be defined over a discrete set.)
In general, I think the definition of distribution should be large enough to handle things like categorial distributions defined over an arbitrary set. It would be nice to be able to run something like empirical_distribution(x)
over arbitrary input data and have it generate a distribution type variable that can be used with pdf()
, cdf()
, quantile()
and sample()
.
On Tue, May 22, 2012 at 3:51 PM, John Myles White reply@reply.github.com wrote:
I think that treating all continuous distributions as using Float64's and all discrete integer distributions as Int64's isn't likely to a problem. (Of course, we do need to keep in mind that a distribution like the Poisson may have continuous parameters, but only be defined over a discrete set.)
In general, I think the definition of distribution should be large enough to handle things like categorial distributions defined over an arbitrary set. It would be nice to be able to run something like
empirical_distribution(x)
over arbitrary input data and have it generate a distribution type variable that can be used withpdf()
,cdf()
,quantile()
andsample()
.
Actually quantile is already defined for Array's of Number's in base/statistics.jl
True. But the other functions like pdf are missing, so it might be good to make a general purpose tool that includes quantiles of an array of Number's just for completeness.
I think that treating all continuous distributions as using Float64's and all discrete integer distributions as Int64's isn't likely to a problem.
I'm kind of thinking about things like "What if we wanted to define symbolic distributions?" Then you could have a RealSymbol type but have the distribution represented symbolically. I think that's way too much generality for a first cut though, so I'm pretty convinced to just stick with Float and Integer types.
(Of course, we do need to keep in mind that a distribution like the Poisson may have continuous parameters, but only be defined over a discrete set.)
Of course — no reason that the parameter types and the distribution type need to be the same.
I wonder if it would be possible for distribution objects to be iterables. It would simplify sampling. like
for N in Normal(a,b,100)
do something
end
Would be nice to allow distribution parameters to be distributions themselves. Good for conditioning:
dist = Normal(Gamma(2,2), 3)
pdf(dist)
rand(dist)
for i in dist
bla bla
end
I don't see why this can't be done, and it would in fact be really cool as well. Don't we just need to implement start(), next(), and done()?
I'm still trying to think about methods for generics like pdf, pmf, lpdf, cdf, ccdf (complementary cdf, i.e. 1-cdf but calculated without round-off), lcdf, lccdf, ... and type for discrete or continuous distributions. For many distributions there are restrictions on the possible parameter values and my approach is to check for legitimate values in the constructor and declare a specific type for the fields in the composite type.
type Normal <: ContinuousDistribution
mean::Float64
stddev::Float64
Normal(mu, sd) = sd >= 0 ? new(float64(mu), float64(sd)) : error("stddev must be non-negative")
end
Normal(mu) = Normal(mu, 1)
Normal() = Normal(0,1)
const Gaussian = Normal
type Binomial <: DiscreteDistribution
size::Int
prob::Float64
Binomial(n, p) = n <= 0 ? error("size must be positive") : (0. <= p <= 1. ? new(int(n), float64(p)) : error("prob must be in [0,1]"))
end
I then code methods as in
mean(d::Normal) = d.mean
variance(d::Normal) = d.stddev^2
rand(d::Normal) = d.mean + d.stddev * randn()
function pdf(d::Normal, x::Real)
ccall(dlsym(_jl_libRmath, :dnorm4), Float64, (Float64, Float64, Float64, Int32),
x, d.mean, d.stddev, 0)
end
function lpdf(d::Normal, x::Real)
ccall(dlsym(_jl_libRmath, :dnorm4), Float64, (Float64, Float64, Float64, Int32),
x, d.mean, d.stddev, 1)
end
function cdf(d::Normal, q::Real)
ccall(dlsym(_jl_libRmath, :pnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
q, d.mean, d.stddev, 1, 0)
end
function lcdf(d::Normal, q::Real)
ccall(dlsym(_jl_libRmath, :pnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
q, d.mean, d.stddev, 1, 1)
end
function ccdf(d::Normal, q::Real)
ccall(dlsym(_jl_libRmath, :pnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
q, d.mean, d.stddev, 0, 0)
end
function lccdf(d::Normal, q::Real)
ccall(dlsym(_jl_libRmath, :pnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
q, d.mean, d.stddev, 0, 1)
end
function quantile(d::Normal, p::Real)
ccall(dlsym(_jl_libRmath, :qnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
p, d.mean, d.stddev, 1, 0)
end
function cquantile(d::Normal, p::Real)
ccall(dlsym(_jl_libRmath, :qnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
p, d.mean, d.stddev, 0, 0)
end
function cdf{T<:Real}(d::Distribution, x::AbstractArray{T})
reshape([cdf(d, e) for e in x], size(x))
end
If this seems like a reasonable approach then I will code up as much as can be expressed with libRmath functions. If later it seems that the types inheriting from Distribution should be changed I don't think the code will need to be modified all that much.
As I said in the discussion on julia-dev, I don't think there is that much distinction between a discrete and a continuous distribution except in terms of pmf or pdf distribution and in the type returned from the rand method.
Isn't ddf (decumulative distribution function) better than ccdf? It's a pretty standard name.
@dmbates: that's exactly it. Super straightforward, no? I did add type parameters to the distribution types. This is really only so that you can have different integer versions and both Float32 and Float64, but if the underlying Rmath code only does Float64, then we may as well ditch all of that for now anyway.
I wonder if it would be possible for distribution objects to be iterables. It would simplify sampling.
Not for iterating the distributions themselves since Normal(a,b,100)
doesn't mean anything — the Normal
distribution only has two parameters, which can optionally be reduced to one or zero. We could provide a Sample
iterator, however and write this as
d = Normal(a,b)
for x in Sample(d,100)
# do something
end
Would be nice to allow distribution parameters to be distributions themselves. Good for conditioning:
That's a cool idea and I suspect that Mathematica can do this sort of thing, but they spent millions of dollars on R&D to produce Mathematica 6 & 7, which completely blew my mind with what they could do for stats stuff. I guess in cases where we happen to know what kind of distribution is produced, we could maybe do that.
Then there's the issue that the resulting distribution may not be of the type indicated by the constructor call. That is, although Normal(Gamma(2,2), 3)
is almost certainly a Normal
distribution, that's not going to be true for other distributions, e.g., Exponential(Gamma(2,2),3)
.
If anyone hasn't seen this yet, here's my initial cut on defining distributions:
https://github.com/JuliaLang/julia/blob/master/base/distributions.jl
Isn't ddf (decumulative distribution function) better than ccdf? It's a pretty standard name.
I've never head this term before and it doesn't even pull up the CDF/CCDF wikipedia page when I google it. The top hit is some blog post on a site called hedgefundwriter.com. All of this leads me to believe it's not a terribly standard term.
@vitoshka Sure, happy to change the name to ddf, especially at this point. I will stay with lcdf (log cdf) and lddf (log ddf) unless there are better suggestions. Any suggestions about the complement of the quantile (this is defined so that you can call cquantile(Normal(), 1e-100) to get some point way out in the right hand tail. There is also versions for the quantiles given the log-probability. For the time being I am using quantilel and cquantilel for those generics. I doubt whether they will be widely used but it is always good to have them for the hard-core numeric geeks.
General question: Should the rand method for a discrete distribution return an integer type? Technically a discrete distribution does not need to have integer support. You could, for example, consider a random sample from a binomial on the scale of the proportion of successes. I suppose we could define an abstract type for discrete distributions with integer support and have the rand methods for them return an integer type.
Why not just logcdf, logddf? It is still short enough, and more readable.
-viral
On 23-May-2012, at 11:20 PM, dmbates wrote:
@vitoshka Sure, happy to change the name to ddf, especially at this point. I will stay with lcdf (log cdf) and lddf (log ddf) unless there are better suggestions. Any suggestions about the complement of the quantile (this is defined so that you can call cquantile(Normal(), 1e-100) to get some point way out in the right hand tail. There is also versions for the quantiles given the log-probability. For the time being I am using quantilel and cquantilel for those generics. I doubt whether they will be widely used but it is always good to have them for the hard-core numeric geeks.
General question: Should the rand method for a discrete distribution return an integer type? Technically a discrete distribution does not need to have integer support. You could, for example, consider a random sample from a binomial on the scale of the proportion of successes. I suppose we could define an abstract type for discrete distributions with integer support and have the rand methods for them return an integer type.
Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/859#issuecomment-5879953
I'm against the name ddf
. Let's just consistently prefix complementary versions of functions with a single "c". I would also prefer to spell out logcdf
— it's not that common to use and will be much more readable and not much harder to type. Similarly I'd favor logccdf
, cquantile
. Not sure about quantilel
and cquantilel
. Do we really need those names? Are they going to give any different answer than just applying log
and exp
in various places? In some cases the values can be computed in a better, non-obvious fashion. In others it seems like names for their own sake when just writing what you want with functions would be just as easy and clearer.
For the quantilel
and cquantilel
it may be clearer if we call them invlogcdf
and invlogccdf
since they're only dubiously related to quantiles anymore and the really what you want is the function that inverts what you computed in the other direction.
We could provide a
Sample
iterator,
Not bad. Though I personally like to think of random variables (which Normal(a, b) presumably is) as random number generators. And it's good for implementation conditional random simulators -- child distribution can just ask the parent for a next value of the parameter.
That's a cool idea and I suspect that Mathematica can do this sort of thing, but they spent millions of dollars on R&D to produce Mathematica 6 & 7, which completely blew my mind with what they could do for stats stuff. I guess in cases where we happen to know what kind of distribution is produced, we could maybe do that.
In order to be useful Julia need not have "symbolic" pdf or cdf for things like conditioning Dist1(Dist2, ), compositions (Dist1 o Dist2) , and results of math operations (Dist1 + Dist2, Dist1 * Dist2, etc). There will be very few cases when this is possible analytically (even in Mathematica 8 ;-) ).
The crucial thing is to be able to draw from these composite distributions, as this is of the paramount practical importance. Pdf, cdf etc could be computed as empirical distributions or smoothed kernel densities.
Stefan Karpinski reply@reply.github.com on Wed, 23 May 2012 10:48:21 -0700 wrote:
Isn't ddf (decumulative distribution function) better than ccdf? It's a pretty standard name. I've never head this term before and it doesn't even pull up the CDF/CCDF wikipedia page when I google it.
You haven't heard about it because it's not in probability textbooks. The reason is, it is just not need there. It is not extremely useful in statistics either.
The terms is extremely common in applied fields especially in finance, insurance an decision theory. It comes form the theory of non-additive probability and integration (because there it really place a crucial role). Wikipedia is written by first year bachelors, I would not go for that as an ultimate reference. Search on google scholar to get an idea of the usage.
It sounds natural and unambiguous. Compare: Complementary Cumulative Distribution Function (ccdf) Decumulative Distribution function (ddf)
As to consistency, "quantile" is also a bad name. I always have to think twice what it means. invcdf and invddf are a way more intuitive and graphic.
At the end you will have cdf, ddf, logcdf, logddf, invcdf, invddf loginvcdf, loginvddf. What can be more consistent?
I've used CCDFs extensively. They're extremely common when checking if empirical distributions fit certain theoretical distributions. So, I'm not sold on the ddf
name, but I'm with you on invcdf
and logcdf
. It should be invlogcdf
not loginvcdf
because you don't apply the log
function to the inverse of the cdf
— you apply the exp
function. I think we should have quantile
be an alias for invcdf
and otherwise just use the inv
prefix consistently.
Google Scholar hits:
Complementary Cumulative Distribution Function: 160,000 Decumulative Distribution Function: 570
Edit: sorry, forgot to put the search terms in quotes.
"Complementary Cumulative Distribution Function" : 4,770 "Decumulative Distribution Function" : 227
Hahahahaha. So good.
I will chime in more later. Busy packing. Moving to Seattle tomorrow.
-- John
On May 23, 2012, at 3:36 PM, Mike Nolta wrote:
Google Scholar hits:
Complementary Cumulative Distribution Function: 160,000 Decumulative Distribution Function: 570
Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/859#issuecomment-5882467
Google Scholar hits:
Complementary Cumulative Distribution Function: 160,000 Decumulative Distribution Function: 570
That seems pretty decisive to me. Similarly, on plain old Google:
Complementary Cumulative Distribution Function: 79,800 Decumulative Distribution Function: 3,580
Not quite as extreme a difference, but it seems like ccdf
is an overwhelmingly more common term.
Mike Nolta reply@reply.github.com on Wed, 23 May 2012 12:36:29 -0700 wrote:
Google Scholar hits: Complementary Cumulative Distribution Function: 160,000 Decumulative Distribution Function: 570
Complementary is an English word, and decumulative is not. Try
Try "Complementary Cumulative Distribution Function" to get 4.700. Quite a difference, isn't it? Decumulative is a new name, people started using it in 1990s with the advent of non-additive integration.
In probability and statistics everything could be done equally well with CDF or 1-CDF; it doesn't make much difference. It's all about more intuitive and shorter name.
History is full of misnomers just because the first guy didn't think much about the name. At the end you guys have to choose to go with more meaningful new names or with older classical once.
I will add decumulative to wikipedia tomorrow, to reinforce the new standards :)
Decumulative is a new name, people started using it in 1990s with the advent of non-additive integration.
This is precisely why this is a lousy choice — it's not an actual word. People don't know what it means and they don't use it. Googling decumulative returns no definitions or wikipedia pages. Googling "non-additive integration" returns some scholarly results, but I've never heard of it before and I've done a lot of measure and integration theory. I'm sorry, this is really not a mainstream thing.
Repeat myself here, complementary cdf was used for a century. Decumulative for a decade and many people haven't even heard about it. Let's talk in another 10 years to see how things evolve. It's just a better name.
The latest standards in measure theory are http://www.amazon.com/Handbook-Measure-Theory-two-volumes/dp/0444502637 and http://books.google.nl/books?id=WfGJcgAACAAJ&dq=non+additive+integral&hl=en&sa=X&ei=2UO9T52FAams0QX66rE9&sqi=2&ved=0CDMQ6AEwAA
I don't know how these books use the term but that should be the standard, whatever it is.
Stefan Karpinski reply@reply.github.com on Wed, 23 May 2012 12:47:28 -0700 wrote:
That seems pretty decisive to me. Similarly, on plain old Google:
Complementary Cumulative Distribution Function: 79,800 Decumulative Distribution Function: 3,580
Not quite as extreme a difference, but it seems like
ccdf
is an overwhelmingly more common term.
The Handbook of Measure Theory does not contain the term "decumulative" anywhere.
The Handbook of Measure Theory does not contain the term "decumulative" anywhere.
How do you search? Amazon contains only 10 pages contents preview. I will try to find the book tomorrow, to have my piece of mind.
Not a big deal anyways. It's just a name. Both are fine.
I used amazon and google book searches. I was under the impression that these both search the entire contents and show you snippets — you just can't view the entire book. It's also not in the book's index. Regardless of which phrase is better, ccdf
will cause less confusion and mean something to more people without them having to look it up.
Right now I am using the name ccdf for the generic but it can be easily changed if I ever get the macros working. I prefer to avoid error-prone cut-and-paste programming and would like to define methods like
function pdf(d::Logistic, x::Real)
ccall(dlsym(_jl_libRmath, :dlogis), Float64, (Float64,Float64,Float64,Int32), x, d.location, d.scale, 0)
end
within a macro. However, I am having difficulty generating an expression for the first argument. Here is my most recent unsuccessful attempt
macro _jl_cont_dist_2par(typ, base, p1, p2)
ddsym = dd = symbol(strcat("d", string(base)))
tt = strcat("d::", string(typ))
if (string(base) == "norm")
ddsym = :dnorm4
end
quote
function pdf($tt, x::Real)
ccall(dlsym(_jl_libRmath, $string(ddsym)),
Float64, (Float64, Float64, Float64, Int32),
x, $p1, $p2, 0)
end
end
end
In actual use the macro will go on to define several other methods but the patterns will be similar.
Suggestions would be welcome.
You can't just build a string — our expressions are real ASTs, not just strings that get parsed. You could build a string and then parse it to get an AST, but that's a bit crazy. Let me play around a bit and come up with something helpful...
This works:
julia> T = Logistic
Logistic{T}
Methods for generic function Logistic
Logistic{T}(T,T)
julia> f = :(:dlogis)
"dlogis"
julia> p1 = :(:mean)
:mean
julia> p2 = :(:scale)
:scale
julia> @eval function pdf(d::($T), x::Real)
ccall(dlsym(_jl_libRmath, $f),
Float64, (Float64, Float64, Float64, Int32),
x, d.($p1), d.($p2), 0)
end
julia> pdf(Logistic(1.0,2.0), 1.5)
0.12306704136879916
The double-quoting business is because if you want the expression for creating a symbol in the final AST, not just the symbol itself. So you'd probably want to write the macro like this:
macro _jl_cont_dist_2par(T, f, p1, p2)
f = expr(:quote,f)
p1 = expr(:quote,p1)
p2 = expr(:quote,p2)
quote
function pdf(d::($T), x::Real)
ccall(dlsym(_jl_libRmath, $f),
Float64, (Float64, Float64, Float64, Int32),
x, d.($p1), d.($p2), 0)
end
end
end
and then use it like this:
@_jl_cont_dist_2par Logistic dlogis mean scale
Basically, you want to specify the function name and parameter names as symbols but then doubly quote them.
For some distributions I have added methods for the mean, median and variance generics. The std generic from base/statistics.jl has a single method for a Distribution that takes the square root of the variance. I just saw that I should have used the var generic for variance and not introduced a new generic.
There are many more methods for mean, median, var, kurtosis, etc. that could be added using formulas from, say, wikipedia.org pages for the distributions. I ran out of energy.
I think that using the same generic functions for var
, mean
, median
, etc. is the right way to go. The generic function embodies "give me the mean of this thing" — if that thing is a distribution, this is what it should do; the the thing is a vector of values, then it should give you the sample mean.
This issue has achieved its goal for kicking off the new RNG interface design and implementation. Is it worthwhile to leave it open for further discussion, or should we close it?
I think the issue could be closed. It seems to me that further discussion, if needed, could take place under newly created issues.
Based on the discussion on the mailing list, the following interface is proposed for RNGs.
https://groups.google.com/forum/?fromgroups#!topic/julia-dev/dO5Uai1h56o