Comments on "Bayesian robust correlation with Stan in R"

Hi, Nice blog post. Since the blog post didn't include contact info, I'll make an issue and I hopefully you find some of my comments useful.

Remove for loop to get 20x speedup (compute computationally costly common terms depending on nu, mu and sigma only once)
Generate x_rand in generated quantities for small extra speedup
Don't use uniform().
- It's not needed for rho, which has implicit uniform prior on constrained range.
- Don't use uniform to constrain parameters, so for sigma it's better to use, e.g. half-normal.
nu below 1 is unlikely to be sensible, as then the distribution has no finite moments (except 0th moment) so I would use constraint
Exponential(1/30) gives a lot of mass for very large values of nu. I prefer gamma(2,0.1); proposed and anlysed by Juárez and Steel (2010). See also discussion in https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations.
Traces are too crowded and it's not really possible to see the convergence from those. It would be nice if you would also print(cor.clean) to show n_eff's and Rhat's (they look good)
It would be usually better to not initialize all the chains with same inits, as it makes the adaptation in warm-up and convergence diagnostic less reliable. For this specific model it may work, but it would be good to mention that in general it's not recommended. It's better to use overdispersed inits for the chains.
I think you are running way too long warmup for this easy problem.
I think you are running way too long chains, it seems n_eff about 2000 would give you at least 2 digit accuracy for rho, which is probably sufficient. The relative efficiency seems to be around 0.5, so no more than 4000 post-warmup draws combined from different chains would be sufficient.
It would be better to run 4 chains, instead of 2 or 1. With the faster Stan code below, there should not be excuse for running 4 chains :)

Modified Stan code

data {
    int<lower=1> N;  // number of observations
    vector[2] x[N];  // input data: rows are observations, columns are the two variables
}

parameters {
    vector[2] mu;                 // locations of the marginal t distributions
    real<lower=0> sigma[2];       // scales of the marginal t distributions
    real<lower=1> nu;             // degrees of freedom of the marginal t distributions
    real<lower=-1, upper=1> rho;  // correlation coefficient
}

transformed parameters {
    // Covariance matrix
    cov_matrix[2] cov = [[      sigma[1] ^ 2       , sigma[1] * sigma[2] * rho],
                         [sigma[1] * sigma[2] * rho,       sigma[2] ^ 2       ]];
}

model {
  // Likelihood
  // Bivariate Student's t-distribution instead of normal for robustness
  x ~ multi_student_t(nu, mu, cov);

  // Uninformative priors on all parameters
  sigma ~ normal(0,100);
  mu ~ normal(0, 100);
  nu ~ gamma(2, 0.1);
}

generated quantities {
  vector[2] x_rand;             // random samples from the bivariate t distribution
  // Draw samples from the estimated bivariate t-distribution (for assessment of fit)
  x_rand = multi_student_t_rng(nu, mu, cov);
}

Hi Aki,

Thanks a lot for your comments! It's true that this is not the best format for discussion. I just intended this to be a kind of manual, but I think I'll turn it into a proper blog post soon.

Your suggestions are really interesting and I agree with all of them, let me make some comments below.

Remove for loop to get 20x speedup (compute computationally costly common terms depending on nu, mu and sigma only once)

I wasn't sure whether the multivariate t could be vectorised in Stan (as each sample is already a vector), so I just played safe and left it as in JAGS. (I could also have tried to see if it worked without the loop, but then I would have risked failure!)

Generate x_rand in generated quantities for small extra speedup

Ah, that makes sense. I didn't remember about the rng functions, so when I first tried to include x_rand it in generated quantities using the ~ operator it didn't work.

Don't use uniform(). It's not needed for rho, which has implicit uniform prior on constrained range. Don't use uniform to constrain parameters, so for sigma it's better to use, e.g. half-normal.

Ah, I didn't know Stan has uniform default priors!

Also, I was looking into alternatives to the uniform (half-Cauchy, half-normal), but I understood that while those behave better on lower values, the uniform is supposed to be less informative when there is little data? (http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf)

I presume that for this model it makes no difference, but I was just curious on why the uniform is so undesirable. I imagine that a very weak uniform prior can slow down convergence. Since prior selection is such a tricky thing, I stuck to the original model by Rasmus Bååth, but with even wider ranges on the uniforms.

nu below 1 is unlikely to be sensible, as then the distribution has no finite moments (except 0th moment) so I would use constraint

I read about nu < 1 in this post: http://doingbayesiandataanalysis.blogspot.co.uk/2015/12/prior-on-df-normality-parameter-in-t.html

It seems that there is no risk in enforcing nu ≥ 0 instead of nu ≥ 1, but since both take the same amount of coding in Stan, I see it's better practice to do <lower=1>.

Exponential(1/30) gives a lot of mass for very large values of nu. I prefer gamma(2,0.1); proposed and anlysed by Juárez and Steel (2010). See also discussion in https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations.

Thanks for the link, it's really useful! I helps also to understand why some priors are less desirable.

It would be usually better to not initialize all the chains with same inits, as it makes the adaptation in warm-up and convergence diagnostic less reliable. For this specific model it may work, but it would be good to mention that in general it's not recommended. It's better to use overdispersed inits for the chains.

Yes, I realised that it makes little sense to run multiple chains with the same inits. I think it would make no practical difference to just use default inits for this model, so I think I'll get rid of that part.

I think you are running way too long warmup for this easy problem. I think you are running way too long chains, it seems n_eff about 2000 would give you at least 2 digit accuracy for rho, which is probably sufficient. The relative efficiency seems to be around 0.5, so no more than 4000 post-warmup draws combined from different chains would be sufficient.

I know, I just love overkilling! Actually, I was trying to see if running for quite long would give me probability estimates other than 0 and 1 in the later part of the post, but rho is too extreme for that. The function I wrote uses iter=6000, warmup=1000 and chains=1 by default, but I'll follow your advice and use 4 chains and less iterations.

Thanks again for all your input, it's really useful knowledge!

Best, Adrian

I wasn't sure whether the multivariate t could be vectorised in Stan (as each sample is already a vector)

Note that the type had to changed from matrix to array of vectors to make this work.

(http://www.stat.columbia.edu/~gelman/research/published/taumain.pdf)

You can see in prior recommendations wiki that Andrew now partially disagrees with the past Andrew, It's on of the problem of with journals, that they don't allow changing the article afterwards.

I was just curious on why the uniform is so undesirable. I imagine that a very weak uniform prior can slow down convergence

Logical constraints are ok, like rho logically has to be between -1 and 1, but there is rarely information which give sharp upper limit for scale like sigma in this model and then decreasing tail as soft threshold takes int account the uncertainty that we don't know any sharp limit.
If prior theta ~ uniform(a, b) and a or b is in the range of the parameter, the uniform prior causes sharp edge in the log density, and HMC doesn't work with well with those.
Thick tails can cause slow mixing for MCMC,

Hi Aki,

Thanks for the explanations! I see the point against the uniform when there is no upper limit. But I was thinking about whether 100 is a large enough sigma for a flexible noninformative normal prior on mu. What if someone comes with data in extremely large/small values, would that be covered by your proposed prior?

Tangentially, I'd be very interested to know what would be your parameter choice for a Dirichlet prior that is uninformative and relatively easy to sample from. We have a model where the likelihood is multinomial, with the parameters of the multinomial given by the product of two Dirichlets. Currently we are setting all parameters in each Dirichlet to 0.5, which is supposed to be the Jeffrey's prior for this distribution, but we don't know how good that is compared to the default Stan prior or, say, setting all parameters to 1. The model currently has quite slow mixing and very small n_effs; I wonder if this is because Dirichlets are intrinsically hard to sample from.

Thank you very much for your help!

Hi Aki, I've posted an updated version of the post, including your comments and your suggested model. It's interesting to see that the distribution of x_rand is much less spread in your model than in the original one! https://baezortega.github.io/2018/05/28/robust-correlation/

But I was thinking about whether 100 is a large enough sigma for a flexible noninformative normal prior on mu. What if someone comes with data in extremely large/small values, would that be covered by your proposed prior?

If you use 100000, then someone can still come with data in extremely large/small values. Priors should be set based on at least weak information on the magnitude of the measurements. In this simple model, and with not really small $n$, uniform prior might work, too.

Tangentially, I'd be very interested to know what would be your parameter choice for a Dirichlet prior

I recommend to ask this in Stan discussion forum http://discourse.mc-stan.org

I've posted an updated version of the post

Great! Just one small thing...

Gelman-Rubin convergence statistic (R-hat)

Gelman-Rubin paper doesn't describe the current split,-version and Gelman recommends calling the version in Stan as split-R-hat, and it's described in BDA3 and Stan manual.

You could post a link to your post also in Stan discussion forum http://discourse.mc-stan.org

baezortega / bayes

Comments on "Bayesian robust correlation with Stan in R" #1