Prototyping checkpoint - Githubissues

fredo-dedup commented 11 years ago

Syntax :

Are Ranges for specifying #steps and burnin a useful shortcut ? If it doesn't feel natural to an average user that might be a bit too much and we could revert to the more standard ˋrun(tasks, steps=., burnin=.)ˋ
Same question for using keyword args for steps and burnin, we could revert to standard arguments as the function calls has few arguments
In the same vein, does using ˋ_ˋ to build MCMCTasks feels ok to you ? May be the pointwise multiply ˋ._ˋ would be more exact...

Use of Julia Tasks :

These are nice for storing an internal state and calling the sampler when needed. But the reset call looks like a hack and may bring issues in other contexts.
I am not sure, but I vaguely remember that Tasks cannot be used on multiple cores which would be a major showstopper. Should we be using remote calls ?

Looking forward :

We need to port the remaining samplers of GeometricMCMC.jl and SimpleMCMC.jl
We should provide basic analysis tools on the Chains : Effective sample size, a summary function (the R coda library would be the ultimate reference here)
The MCMCChain type should output a more user-friendly structure for the model parameters. A potential solution is to return DataFrames though we have to ponder if the added dependency and the user familiarity required with this kind of structure are worth it.
The autodiff should really be expanded to include censoring and truncation, basic functions such as if, for loops, etc.., and a larger number of available distributions.

johnmyleswhite commented 11 years ago

I think we should be pretty conservative with the API. I'm working through your code tonight and will try to give detailed responses soon.

johnmyleswhite commented 11 years ago

I'd strongly prefer using ˋrun(tasks, steps=., burnin=.)ˋ as the API. The use of Range and * both seem a little obscure to me.

In general, I think the components of the system should be very explicit and their combination should occur using simple English names like run.

In regard to Tasks and parallelization, I would think we'd be alright if we had a function to run a single chain forward: we should be able to run the chains in parallel using @parallel which we reduce use hcat. (I'm less sure about the needs of a population MCMC sampler.)

Emulating Coda seems like the right approach for post-processing tools.

fredo-dedup commented 11 years ago

Thanks for your feedback ! I'll remove the ˋRangeˋ thing and check how ˋ@parallelˋ work.

Before moving on, I'll also wait for @scidom to give his opinion (all other comments welcome of course).

dmbates commented 11 years ago

For post-processing tools Deepayan Sarkar and I created a fast method of determining HPD (Highest Posterior Density) intervals based on a sample from the posterior density, assuming the density is unimodal. It relies on the fact that the HPD interval is the shortest interval with a given probability content so we just need to calculate the difference of the f'th quantile and the (1+f-alpha)'th quantile for f between 0 and alpha.

Unfortunately, we didn't publicize the technique much.

johnmyleswhite commented 11 years ago

Do you use something like Brent's method to select f? Or do you perform a grid search?

johnmyleswhite commented 11 years ago

Since we're looking for opinions, it would also be good to hear from @doobwa, @lindahua and others about their needs. I'd really like to make sure that we can build an MCMC infrastructure that is modular enough that it can be used by anyone interested in MCMC in Julia. I know that @lindahua has his own plans for a probabilistic programming interface, which we should try to maintain compatibility with. And @doobwa wrote some MCMC code that I'm currently revising for speed, which we should make sure ends up in this repo.

My sense (which may prove naive) is that we can define a set of data structures and interfaces that is shared across all sampling algorithms, all probabilistic programming dialects and all post-processing tools. I'd like to avoid the weirdnesses that come up with when trying to do MCMC in R, where you have multiple competing standards (e.g. the existence of coda.samples and jags.samples in rjags).

dmbates commented 11 years ago

@johnmyleswhite I would usually use a grid search, using a partial sort to get the smallest alpha'th fraction of the values and the largest alpha'th fraction. Of course, working in R I went to great pains to vectorize the calculation and that may not be the best option in Julia.

lindahua commented 11 years ago

My current focus is towards variational inference and stochastic optimization (due to the need of processing big data). Hence, I think I won't be working closely with MCMC in near future (except for some specific things for Bayesian nonparametrics). I think you may go ahead to explore the API.

papamarkou commented 11 years ago

Great @fredo-dedup, I'll have access to my laptop from Thursday when I'm returning from holidays. I will read your code and will contribute (to geometric MCMC among else).

ESS, asymptotic variance, Monte Carlo autocorrelation estimators and other relevant tools will be a great addition indeed.

@johnmyleswhite, the parallelisation macros sound very useful and could be integrated with the ClusterManager package.

papamarkou commented 11 years ago

Hi @fredo-dedup, I familiarized myself more with your code; produce() and pmap() have become clearer, after looking at the control flow (more particularly at tasks) and parallel computing sections of the Julia manual. I will contribute to the code within the next few days, i) by adding some examples, ii) some samplers, and iii) by looking at the popMCMC code I want to migrate to Julia (which seems to be the most challenging task since it requires doing some work with the developing cluster managers). More soon.

fredo-dedup commented 11 years ago

Sorry for the delay in responding @scidom. It's great that you could pick up the project so quickly considering the state of the code I left by the end of July.

Do not hesitate to file "issues" here if you have questions about specific parts of my code.

In the near term, I'll be trying to set up a benchmarking functionality and see if we can connect directly to Distributions.jl.

papamarkou commented 11 years ago

The serial MCMC code was in a very good state actually @fredo-dedup. Thanks for doing all this initial work. I have one "issue"-question at the moment, which is how we could incorporate the adaptation of each samplers' stepsize, I will submit it as a separate issue. I agree that submitting questions to the repository is a good way of communicating.

I will add the RMHMC/MMALA routines here in the next few days.

I am really excited about the autodiff you ported, as I will need it to run some MCMC simulations related to my work, which is great.

papamarkou commented 10 years ago

This was our first prototyping checkpoint a while ago. Since we have addressed most of the generic points made here, I will close this issue. If there are more specific prototyping questions we have in mind, we can open more focused issues on them.

JuliaStats / Klara.jl

Prototyping checkpoint #1