JuliaStats / Distributions.jl

A Julia package for probability distributions and associated functions.
Other
1.1k stars 414 forks source link

Direct access to the random sampling functions #18

Closed nfoti closed 11 years ago

nfoti commented 11 years ago

I think that for certain applications it would be useful to be able to generate random variables (other than uniform and Gaussian) without having to instantiate the distribution objects. For instance, when writing MCMC samplers the parameters of the distributions will change from one iteration to the next and the overhead of having to create those distribution objects will add up and

I ran a simple example to compare julia and python. Generating 1000 samples of a 500-dimensional Dirichlet random vector with rand(Dirichlet(gam), 500) (where gam = rand(Gamma(1), 500)) takes about 150 milliseconds. Doing the same thing with python takes about 86 milliseconds.

I wrote a simple function in julia to generate a Dirichlet random vector that is comparable to the python code for generating 1000 samples of the 500-dimensional vectors. The function is:

function rdirich(alph)
  nt = length(alph)
  t = zeros(nt)
  for i in 1:nt
    t[i] = ccall(("rgamma", Rmath), Float64, (Float64,Float64), alph[i])
  end
  t = t / sum(t)
end

where Rmath = :libRmath as in Distributions.jl

I think that the current interface is useful for certain tasks, but having functions like above that can generate a sample (or more than 1 like in R) without having to create the object would be really useful if the distribution object will be changing a lot.

I am happy to implement this functionality if you think it belongs in the package. Otherwise, I could start a discussion on julia-dev to see where this functionality should go.

Thanks.

johnmyleswhite commented 11 years ago

I've been thinking that we should do this for sometime. But Toivo has noted in the past that it should be possible to get a major speedup when Julia has immutable.

For the moment, I'd be happy to have something like this.

For MCMC, I'd personally just keep a permanent distribution object and update its parameters over time. I would think that would fix some of the slowdown you're seeing.

nfoti commented 11 years ago

I thought about using a permanent distribution object but I tried the same experiment I described in my original message with a fixed distribution object, that is calling dir = Dirichlet(gam); rand(dir) and the run time was comparable to rand(Dirichlet(gam)).

Maybe immutable types will help with this at some point, but since you don't sound opposed I'll implement wrappers for the R sampling functions.

Nick

On Sun, Feb 24, 2013 at 9:57 PM, John Myles White notifications@github.comwrote:

I've been thinking that we should do this for sometime. But Toivo has noted in the past that it should be possible to get a major speedup when Julia has immutable.

For the moment, I'd be happy to have something like this.

For MCMC, I'd personally just keep a permanent distribution object and update its parameters over time. I would think that would fix some of the slowdown you're seeing.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Distributions.jl/issues/18#issuecomment-14023519.

johnmyleswhite commented 11 years ago

I think that's wise. Can I propose using types as tags rather than introducing a new API with functions like rdirich(alph)? I'd prefer: rand(Dirichlet, alpha).

nfoti commented 11 years ago

Yes, no problem, I was not a fan of rdirich either.

On Sun, Feb 24, 2013 at 10:38 PM, John Myles White <notifications@github.com

wrote:

I think that's wise. Can I propose using types as tags rather than introducing a new API with functions like rdirich(alph)? I'd prefer: rand(Dirichlet, alpha).

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Distributions.jl/issues/18#issuecomment-14024285.

johnmyleswhite commented 11 years ago

Great! Thank you so much for looking into this.

nfoti commented 11 years ago

Sorry to bother you, but I'm blanking on the elegant way to implement a method for rand that takes the Dirichlet type as an argument

Thanks.

Nick

On Mon, Feb 25, 2013 at 7:47 AM, John Myles White notifications@github.comwrote:

Great! Thank you so much for looking into this.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Distributions.jl/issues/18#issuecomment-14039356.

nfoti commented 11 years ago

Nevermind, I figured it out. I'll add implementations and tests as I have time and will let you know when it's done.

On Mon, Feb 25, 2013 at 7:58 AM, Nick Foti nfoti01@gmail.com wrote:

Sorry to bother you, but I'm blanking on the elegant way to implement a method for rand that takes the Dirichlet type as an argument

Thanks.

Nick

On Mon, Feb 25, 2013 at 7:47 AM, John Myles White < notifications@github.com> wrote:

Great! Thank you so much for looking into this.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Distributions.jl/issues/18#issuecomment-14039356.

johnmyleswhite commented 11 years ago

What happened with this? It might be good to try this out soon to see what kind of performance improvements we can obtain.

nfoti commented 11 years ago

Hi,

Unfortunately, I won't have time to work on Julia until mid-June as I have to finish my thesis, interview for postdocs, defend my thesis, and present at some conferences.

I am looking forward to getting back into Julia once my life has settled down a bit though. If you or anyone else wants to take over implementing this idea, by all means go for it.

On Mon, Apr 8, 2013 at 12:57 PM, John Myles White notifications@github.comwrote:

What happened with this? It might be good to try this out soon to see what kind of performance improvements we can obtain.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Distributions.jl/issues/18#issuecomment-16063304 .

johnmyleswhite commented 11 years ago

Ok. I'll make a pass through this in mid-May.

Good luck with everything.

nfoti commented 11 years ago

Hi,

So I think I have some time to dedicate to Julia again. Do we still think that incorporating these functions is useful? If so I think I'll take a stab at it over the next few weeks.

Thanks.

On Tue, Apr 9, 2013 at 4:10 PM, John Myles White notifications@github.comwrote:

Ok. I'll make a pass through this in mid-May.

Good luck with everything.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Distributions.jl/issues/18#issuecomment-16136954 .

johnmyleswhite commented 11 years ago

Yes, I think this would be useful.

johnmyleswhite commented 11 years ago

I would, though, encourage doing some benchmarking to insure that direct access is more efficient than the immutables implementation we have now.

nfoti commented 11 years ago

Great. I'll let you know how it goes or if I have any questions.

On Thu, May 16, 2013 at 10:03 AM, John Myles White <notifications@github.com

wrote:

I would, though, encourage doing some benchmarking to insure that direct access is more efficient than the immutables implementation we have now.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Distributions.jl/issues/18#issuecomment-18002808 .

nfoti commented 11 years ago

I added a bunch of rand functions that take a type as the first argument and the parameters of the distribution as the remaining arguments. I have not had a chance to test them yet, but the code can be found in the newrands branch of my clone of the repository ( https://github.com/nfoti/Distributions.jl). Just search for "Foti" in the code for the changes I made.

I have done a little bit of benchmarking with the gamma distribution. I found that the version of the function that takes a Gamma object is comparable in speed to the new version. Creating the object must not be the bottleneck anymore. Personally, I like not having to make an object every time though.

Looking forward to your comments.

Thanks.

On Thu, May 16, 2013 at 11:20 AM, Nick Foti nfoti01@gmail.com wrote:

Great. I'll let you know how it goes or if I have any questions.

On Thu, May 16, 2013 at 10:03 AM, John Myles White < notifications@github.com> wrote:

I would, though, encourage doing some benchmarking to insure that direct access is more efficient than the immutables implementation we have now.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Distributions.jl/issues/18#issuecomment-18002808 .

nfoti commented 11 years ago

Hi,

I've done a bit more benchmarking with the sampling function for a gamma distribution and it does appear that the immutable type results in the two function calls rand(Gamma(x,1.)) and rand(Gamma, x, 1.) taking the same amount of time. I think it comes down to whether both interfaces should be supported.

Thanks.

Nick

On Thu, May 16, 2013 at 9:55 PM, Nick Foti nfoti01@gmail.com wrote:

I added a bunch of rand functions that take a type as the first argument and the parameters of the distribution as the remaining arguments. I have not had a chance to test them yet, but the code can be found in the newrands branch of my clone of the repository ( https://github.com/nfoti/Distributions.jl). Just search for "Foti" in the code for the changes I made.

I have done a little bit of benchmarking with the gamma distribution. I found that the version of the function that takes a Gamma object is comparable in speed to the new version. Creating the object must not be the bottleneck anymore. Personally, I like not having to make an object every time though.

Looking forward to your comments.

Thanks.

On Thu, May 16, 2013 at 11:20 AM, Nick Foti nfoti01@gmail.com wrote:

Great. I'll let you know how it goes or if I have any questions.

On Thu, May 16, 2013 at 10:03 AM, John Myles White < notifications@github.com> wrote:

I would, though, encourage doing some benchmarking to insure that direct access is more efficient than the immutables implementation we have now.

— Reply to this email directly or view it on GitHubhttps://github.com/JuliaStats/Distributions.jl/issues/18#issuecomment-18002808 .

johnmyleswhite commented 11 years ago

Thanks for looking into this. I want to think a bit more about this, but I'm hesitant to add a second interface. I increasingly think there's something to say for Python's theory that there should only be one standard way to do things, but also appreciate that many people might like an alternative interface.

johnmyleswhite commented 11 years ago

After some thought, I think we should have a single interface to all distributions. If we find that we need to make that interface more modular to improve performance, let's go for that. But for now, I'm going to close this issue.