JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.41k stars 5.45k forks source link

RNG interface redesign #859

Closed ViralBShah closed 12 years ago

ViralBShah commented 12 years ago

Based on the discussion on the mailing list, the following interface is proposed for RNGs.

https://groups.google.com/forum/?fromgroups#!topic/julia-dev/dO5Uai1h56o

type Gaussian{T<:Real} <: SymmetricRealDistribution{T}
    mean::T
    stddev::T
end
Gaussian(m) = Gaussian(m,one(m))
Gaussian() = Gaussian(0.0,1.0)

pdf(d::Gaussian, x::Real) = exp(-(x-d.mean)^2/(2d.stddev^2))/(d.stddev*sqrt(2pi)))
cdf(d::Gaussian, x::Real) = (1+erf(x-d.mean)/(d.stddev*sqrt(2)))/2
StefanKarpinski commented 12 years ago

One open question is whether to bother making distributions generic to all real types or just stick with floats for continuous and ints for discrete. I went generic here, but it's pretty easy to argue that this is an over-generalization. Especially because functions like exp, sqrt and erf are pretty much exclusively going to be floating-point.

ViralBShah commented 12 years ago

Given our type system, I am tempted to say that we should stick with Ints for discrete. The only reason not to do so would be that mixing ints with other floating point code ends up being slower than otherwise.

dmbates commented 12 years ago

Regarding the generic type issue, would SymmetricRealDistribution{T} be an abstract parameterized type? I imagine such things are defined in various base/*.jl files but to save me doing the grepping, could one of you block out the abstract type declarations starting at Distribution?

StefanKarpinski commented 12 years ago

@dmbates: I was just sketching at this point. None of that exists. I was thinking that it would be parametric, but maybe it needn't be — continuous distributions could always just yield Float64 values for samples.

johnmyleswhite commented 12 years ago

I think that treating all continuous distributions as using Float64's and all discrete integer distributions as Int64's isn't likely to a problem. (Of course, we do need to keep in mind that a distribution like the Poisson may have continuous parameters, but only be defined over a discrete set.)

In general, I think the definition of distribution should be large enough to handle things like categorial distributions defined over an arbitrary set. It would be nice to be able to run something like empirical_distribution(x) over arbitrary input data and have it generate a distribution type variable that can be used with pdf(), cdf(), quantile() and sample().

dmbates commented 12 years ago

On Tue, May 22, 2012 at 3:51 PM, John Myles White reply@reply.github.com wrote:

I think that treating all continuous distributions as using Float64's and all discrete integer distributions as Int64's isn't likely to a problem. (Of course, we do need to keep in mind that a distribution like the Poisson may have continuous parameters, but only be defined over a discrete set.)

In general, I think the definition of distribution should be large enough to handle things like categorial distributions defined over an arbitrary set. It would be nice to be able to run something like empirical_distribution(x) over arbitrary input data and have it generate a distribution type variable that can be used with pdf(), cdf(), quantile() and sample().

Actually quantile is already defined for Array's of Number's in base/statistics.jl

johnmyleswhite commented 12 years ago

True. But the other functions like pdf are missing, so it might be good to make a general purpose tool that includes quantiles of an array of Number's just for completeness.

StefanKarpinski commented 12 years ago

I think that treating all continuous distributions as using Float64's and all discrete integer distributions as Int64's isn't likely to a problem.

I'm kind of thinking about things like "What if we wanted to define symbolic distributions?" Then you could have a RealSymbol type but have the distribution represented symbolically. I think that's way too much generality for a first cut though, so I'm pretty convinced to just stick with Float and Integer types.

(Of course, we do need to keep in mind that a distribution like the Poisson may have continuous parameters, but only be defined over a discrete set.)

Of course — no reason that the parameter types and the distribution type need to be the same.

vspinu commented 12 years ago

I wonder if it would be possible for distribution objects to be iterables. It would simplify sampling. like

for N in Normal(a,b,100)
  do something
end

Would be nice to allow distribution parameters to be distributions themselves. Good for conditioning:

dist = Normal(Gamma(2,2), 3)
pdf(dist)
rand(dist)
for i in dist
  bla bla
end
ViralBShah commented 12 years ago

I don't see why this can't be done, and it would in fact be really cool as well. Don't we just need to implement start(), next(), and done()?

dmbates commented 12 years ago

I'm still trying to think about methods for generics like pdf, pmf, lpdf, cdf, ccdf (complementary cdf, i.e. 1-cdf but calculated without round-off), lcdf, lccdf, ... and type for discrete or continuous distributions. For many distributions there are restrictions on the possible parameter values and my approach is to check for legitimate values in the constructor and declare a specific type for the fields in the composite type.

type Normal <: ContinuousDistribution
    mean::Float64
    stddev::Float64
    Normal(mu, sd) = sd >= 0 ? new(float64(mu), float64(sd)) : error("stddev must be non-negative")
end
Normal(mu) = Normal(mu, 1)
Normal() = Normal(0,1)
const Gaussian = Normal

type Binomial <: DiscreteDistribution
    size::Int
    prob::Float64

    Binomial(n, p) = n <= 0 ?  error("size must be positive") : (0. <= p <= 1. ? new(int(n), float64(p)) : error("prob must be in [0,1]"))
end

I then code methods as in


mean(d::Normal) = d.mean
variance(d::Normal) = d.stddev^2
rand(d::Normal) = d.mean + d.stddev * randn()
function pdf(d::Normal, x::Real)
    ccall(dlsym(_jl_libRmath, :dnorm4), Float64, (Float64, Float64, Float64, Int32),
          x, d.mean, d.stddev, 0)
end
function lpdf(d::Normal, x::Real)
    ccall(dlsym(_jl_libRmath, :dnorm4), Float64, (Float64, Float64, Float64, Int32),
          x, d.mean, d.stddev, 1)
end
function cdf(d::Normal, q::Real)
    ccall(dlsym(_jl_libRmath, :pnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
          q, d.mean, d.stddev, 1, 0)
end
function lcdf(d::Normal, q::Real)
    ccall(dlsym(_jl_libRmath, :pnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
          q, d.mean, d.stddev, 1, 1)
end
function ccdf(d::Normal, q::Real)
    ccall(dlsym(_jl_libRmath, :pnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
          q, d.mean, d.stddev, 0, 0)
end
function lccdf(d::Normal, q::Real)
    ccall(dlsym(_jl_libRmath, :pnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
          q, d.mean, d.stddev, 0, 1)
end
function quantile(d::Normal, p::Real)
    ccall(dlsym(_jl_libRmath, :qnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
          p, d.mean, d.stddev, 1, 0)
end
function cquantile(d::Normal, p::Real)
    ccall(dlsym(_jl_libRmath, :qnorm5), Float64, (Float64, Float64, Float64, Int32, Int32),
          p, d.mean, d.stddev, 0, 0)
end

function cdf{T<:Real}(d::Distribution, x::AbstractArray{T})
    reshape([cdf(d, e) for e in x], size(x))
end

If this seems like a reasonable approach then I will code up as much as can be expressed with libRmath functions. If later it seems that the types inheriting from Distribution should be changed I don't think the code will need to be modified all that much.

As I said in the discussion on julia-dev, I don't think there is that much distinction between a discrete and a continuous distribution except in terms of pmf or pdf distribution and in the type returned from the rand method.

vspinu commented 12 years ago

Isn't ddf (decumulative distribution function) better than ccdf? It's a pretty standard name.

StefanKarpinski commented 12 years ago

@dmbates: that's exactly it. Super straightforward, no? I did add type parameters to the distribution types. This is really only so that you can have different integer versions and both Float32 and Float64, but if the underlying Rmath code only does Float64, then we may as well ditch all of that for now anyway.

StefanKarpinski commented 12 years ago

I wonder if it would be possible for distribution objects to be iterables. It would simplify sampling.

Not for iterating the distributions themselves since Normal(a,b,100) doesn't mean anything — the Normal distribution only has two parameters, which can optionally be reduced to one or zero. We could provide a Sample iterator, however and write this as

d = Normal(a,b)
for x in Sample(d,100)
    # do something
end

Would be nice to allow distribution parameters to be distributions themselves. Good for conditioning:

That's a cool idea and I suspect that Mathematica can do this sort of thing, but they spent millions of dollars on R&D to produce Mathematica 6 & 7, which completely blew my mind with what they could do for stats stuff. I guess in cases where we happen to know what kind of distribution is produced, we could maybe do that.

Then there's the issue that the resulting distribution may not be of the type indicated by the constructor call. That is, although Normal(Gamma(2,2), 3) is almost certainly a Normal distribution, that's not going to be true for other distributions, e.g., Exponential(Gamma(2,2),3).

StefanKarpinski commented 12 years ago

If anyone hasn't seen this yet, here's my initial cut on defining distributions:

https://github.com/JuliaLang/julia/blob/master/base/distributions.jl

StefanKarpinski commented 12 years ago

Isn't ddf (decumulative distribution function) better than ccdf? It's a pretty standard name.

I've never head this term before and it doesn't even pull up the CDF/CCDF wikipedia page when I google it. The top hit is some blog post on a site called hedgefundwriter.com. All of this leads me to believe it's not a terribly standard term.

dmbates commented 12 years ago

@vitoshka Sure, happy to change the name to ddf, especially at this point. I will stay with lcdf (log cdf) and lddf (log ddf) unless there are better suggestions. Any suggestions about the complement of the quantile (this is defined so that you can call cquantile(Normal(), 1e-100) to get some point way out in the right hand tail. There is also versions for the quantiles given the log-probability. For the time being I am using quantilel and cquantilel for those generics. I doubt whether they will be widely used but it is always good to have them for the hard-core numeric geeks.

General question: Should the rand method for a discrete distribution return an integer type? Technically a discrete distribution does not need to have integer support. You could, for example, consider a random sample from a binomial on the scale of the proportion of successes. I suppose we could define an abstract type for discrete distributions with integer support and have the rand methods for them return an integer type.

ViralBShah commented 12 years ago

Why not just logcdf, logddf? It is still short enough, and more readable.

-viral

On 23-May-2012, at 11:20 PM, dmbates wrote:

@vitoshka Sure, happy to change the name to ddf, especially at this point. I will stay with lcdf (log cdf) and lddf (log ddf) unless there are better suggestions. Any suggestions about the complement of the quantile (this is defined so that you can call cquantile(Normal(), 1e-100) to get some point way out in the right hand tail. There is also versions for the quantiles given the log-probability. For the time being I am using quantilel and cquantilel for those generics. I doubt whether they will be widely used but it is always good to have them for the hard-core numeric geeks.

General question: Should the rand method for a discrete distribution return an integer type? Technically a discrete distribution does not need to have integer support. You could, for example, consider a random sample from a binomial on the scale of the proportion of successes. I suppose we could define an abstract type for discrete distributions with integer support and have the rand methods for them return an integer type.


Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/859#issuecomment-5879953

StefanKarpinski commented 12 years ago

I'm against the name ddf. Let's just consistently prefix complementary versions of functions with a single "c". I would also prefer to spell out logcdf — it's not that common to use and will be much more readable and not much harder to type. Similarly I'd favor logccdf, cquantile. Not sure about quantilel and cquantilel. Do we really need those names? Are they going to give any different answer than just applying log and exp in various places? In some cases the values can be computed in a better, non-obvious fashion. In others it seems like names for their own sake when just writing what you want with functions would be just as easy and clearer.

StefanKarpinski commented 12 years ago

For the quantilel and cquantilel it may be clearer if we call them invlogcdf and invlogccdf since they're only dubiously related to quantiles anymore and the really what you want is the function that inverts what you computed in the other direction.

vspinu commented 12 years ago

We could provide a Sample iterator,

Not bad. Though I personally like to think of random variables (which Normal(a, b) presumably is) as random number generators. And it's good for implementation conditional random simulators -- child distribution can just ask the parent for a next value of the parameter.

That's a cool idea and I suspect that Mathematica can do this sort of thing, but they spent millions of dollars on R&D to produce Mathematica 6 & 7, which completely blew my mind with what they could do for stats stuff. I guess in cases where we happen to know what kind of distribution is produced, we could maybe do that.

In order to be useful Julia need not have "symbolic" pdf or cdf for things like conditioning Dist1(Dist2, ), compositions (Dist1 o Dist2) , and results of math operations (Dist1 + Dist2, Dist1 * Dist2, etc). There will be very few cases when this is possible analytically (even in Mathematica 8 ;-) ).

The crucial thing is to be able to draw from these composite distributions, as this is of the paramount practical importance. Pdf, cdf etc could be computed as empirical distributions or smoothed kernel densities.

vspinu commented 12 years ago

Stefan Karpinski reply@reply.github.com on Wed, 23 May 2012 10:48:21 -0700 wrote:

Isn't ddf (decumulative distribution function) better than ccdf? It's a pretty standard name. I've never head this term before and it doesn't even pull up the CDF/CCDF wikipedia page when I google it.

You haven't heard about it because it's not in probability textbooks. The reason is, it is just not need there. It is not extremely useful in statistics either.

The terms is extremely common in applied fields especially in finance, insurance an decision theory. It comes form the theory of non-additive probability and integration (because there it really place a crucial role). Wikipedia is written by first year bachelors, I would not go for that as an ultimate reference. Search on google scholar to get an idea of the usage.

It sounds natural and unambiguous. Compare: Complementary Cumulative Distribution Function (ccdf) Decumulative Distribution function (ddf)

As to consistency, "quantile" is also a bad name. I always have to think twice what it means. invcdf and invddf are a way more intuitive and graphic.

At the end you will have cdf, ddf, logcdf, logddf, invcdf, invddf loginvcdf, loginvddf. What can be more consistent?

StefanKarpinski commented 12 years ago

I've used CCDFs extensively. They're extremely common when checking if empirical distributions fit certain theoretical distributions. So, I'm not sold on the ddf name, but I'm with you on invcdf and logcdf. It should be invlogcdf not loginvcdf because you don't apply the log function to the inverse of the cdf — you apply the exp function. I think we should have quantile be an alias for invcdf and otherwise just use the inv prefix consistently.

nolta commented 12 years ago

Google Scholar hits:

Complementary Cumulative Distribution Function: 160,000 Decumulative Distribution Function: 570

Edit: sorry, forgot to put the search terms in quotes.

"Complementary Cumulative Distribution Function" : 4,770 "Decumulative Distribution Function" : 227

johnmyleswhite commented 12 years ago

Hahahahaha. So good.

I will chime in more later. Busy packing. Moving to Seattle tomorrow.

-- John

On May 23, 2012, at 3:36 PM, Mike Nolta wrote:

Google Scholar hits:

Complementary Cumulative Distribution Function: 160,000 Decumulative Distribution Function: 570


Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/859#issuecomment-5882467

StefanKarpinski commented 12 years ago

Google Scholar hits:

Complementary Cumulative Distribution Function: 160,000 Decumulative Distribution Function: 570

That seems pretty decisive to me. Similarly, on plain old Google:

Complementary Cumulative Distribution Function: 79,800 Decumulative Distribution Function: 3,580

Not quite as extreme a difference, but it seems like ccdf is an overwhelmingly more common term.

vspinu commented 12 years ago

Mike Nolta reply@reply.github.com on Wed, 23 May 2012 12:36:29 -0700 wrote:

Google Scholar hits: Complementary Cumulative Distribution Function: 160,000 Decumulative Distribution Function: 570

Complementary is an English word, and decumulative is not. Try

Try "Complementary Cumulative Distribution Function" to get 4.700. Quite a difference, isn't it? Decumulative is a new name, people started using it in 1990s with the advent of non-additive integration.

In probability and statistics everything could be done equally well with CDF or 1-CDF; it doesn't make much difference. It's all about more intuitive and shorter name.

History is full of misnomers just because the first guy didn't think much about the name. At the end you guys have to choose to go with more meaningful new names or with older classical once.

I will add decumulative to wikipedia tomorrow, to reinforce the new standards :)

StefanKarpinski commented 12 years ago

Decumulative is a new name, people started using it in 1990s with the advent of non-additive integration.

This is precisely why this is a lousy choice — it's not an actual word. People don't know what it means and they don't use it. Googling decumulative returns no definitions or wikipedia pages. Googling "non-additive integration" returns some scholarly results, but I've never heard of it before and I've done a lot of measure and integration theory. I'm sorry, this is really not a mainstream thing.

vspinu commented 12 years ago

Repeat myself here, complementary cdf was used for a century. Decumulative for a decade and many people haven't even heard about it. Let's talk in another 10 years to see how things evolve. It's just a better name.

The latest standards in measure theory are http://www.amazon.com/Handbook-Measure-Theory-two-volumes/dp/0444502637 and http://books.google.nl/books?id=WfGJcgAACAAJ&dq=non+additive+integral&hl=en&sa=X&ei=2UO9T52FAams0QX66rE9&sqi=2&ved=0CDMQ6AEwAA

I don't know how these books use the term but that should be the standard, whatever it is.

Stefan Karpinski reply@reply.github.com on Wed, 23 May 2012 12:47:28 -0700 wrote:

That seems pretty decisive to me. Similarly, on plain old Google:

Complementary Cumulative Distribution Function: 79,800 Decumulative Distribution Function: 3,580

Not quite as extreme a difference, but it seems like ccdf is an overwhelmingly more common term.

StefanKarpinski commented 12 years ago

The Handbook of Measure Theory does not contain the term "decumulative" anywhere.

vspinu commented 12 years ago

The Handbook of Measure Theory does not contain the term "decumulative" anywhere.

How do you search? Amazon contains only 10 pages contents preview. I will try to find the book tomorrow, to have my piece of mind.

Not a big deal anyways. It's just a name. Both are fine.

StefanKarpinski commented 12 years ago

I used amazon and google book searches. I was under the impression that these both search the entire contents and show you snippets — you just can't view the entire book. It's also not in the book's index. Regardless of which phrase is better, ccdf will cause less confusion and mean something to more people without them having to look it up.

dmbates commented 12 years ago

Right now I am using the name ccdf for the generic but it can be easily changed if I ever get the macros working. I prefer to avoid error-prone cut-and-paste programming and would like to define methods like

function pdf(d::Logistic, x::Real)
    ccall(dlsym(_jl_libRmath, :dlogis), Float64, (Float64,Float64,Float64,Int32), x, d.location, d.scale, 0)
end

within a macro. However, I am having difficulty generating an expression for the first argument. Here is my most recent unsuccessful attempt


macro _jl_cont_dist_2par(typ, base, p1, p2)
    ddsym = dd = symbol(strcat("d", string(base)))
    tt = strcat("d::", string(typ))
    if (string(base) == "norm")
        ddsym = :dnorm4
    end
    quote
        function pdf($tt, x::Real)
            ccall(dlsym(_jl_libRmath, $string(ddsym)),
                  Float64, (Float64, Float64, Float64, Int32),
                  x, $p1, $p2, 0)
        end
    end
end

In actual use the macro will go on to define several other methods but the patterns will be similar.

Suggestions would be welcome.

StefanKarpinski commented 12 years ago

You can't just build a string — our expressions are real ASTs, not just strings that get parsed. You could build a string and then parse it to get an AST, but that's a bit crazy. Let me play around a bit and come up with something helpful...

StefanKarpinski commented 12 years ago

This works:

julia> T = Logistic
Logistic{T}
Methods for generic function Logistic
Logistic{T}(T,T)

julia> f = :(:dlogis)
"dlogis"

julia> p1 = :(:mean)
:mean

julia> p2 = :(:scale)
:scale

julia> @eval function pdf(d::($T), x::Real)
           ccall(dlsym(_jl_libRmath, $f),
                 Float64, (Float64, Float64, Float64, Int32),
                 x, d.($p1), d.($p2), 0)
       end

julia> pdf(Logistic(1.0,2.0), 1.5)
0.12306704136879916

The double-quoting business is because if you want the expression for creating a symbol in the final AST, not just the symbol itself. So you'd probably want to write the macro like this:

macro _jl_cont_dist_2par(T, f, p1, p2)
    f = expr(:quote,f)
    p1 = expr(:quote,p1)              
    p2 = expr(:quote,p2)              
    quote                             
        function pdf(d::($T), x::Real)
            ccall(dlsym(_jl_libRmath, $f),
                  Float64, (Float64, Float64, Float64, Int32),
                  x, d.($p1), d.($p2), 0)
        end
    end
end

and then use it like this:

@_jl_cont_dist_2par Logistic dlogis mean scale

Basically, you want to specify the function name and parameter names as symbols but then doubly quote them.

dmbates commented 12 years ago

For some distributions I have added methods for the mean, median and variance generics. The std generic from base/statistics.jl has a single method for a Distribution that takes the square root of the variance. I just saw that I should have used the var generic for variance and not introduced a new generic.

There are many more methods for mean, median, var, kurtosis, etc. that could be added using formulas from, say, wikipedia.org pages for the distributions. I ran out of energy.

StefanKarpinski commented 12 years ago

I think that using the same generic functions for var, mean, median, etc. is the right way to go. The generic function embodies "give me the mean of this thing" — if that thing is a distribution, this is what it should do; the the thing is a vector of values, then it should give you the sample mean.

ViralBShah commented 12 years ago

This issue has achieved its goal for kicking off the new RNG interface design and implementation. Is it worthwhile to leave it open for further discussion, or should we close it?

dmbates commented 12 years ago

I think the issue could be closed. It seems to me that further discussion, if needed, could take place under newly created issues.