Issue with clustering example in chapter 3

gustavdelius commented 11 years ago

There is a subtlety in the clustering example in chapter 3 that should probably be pointed out to the reader, as it is quite instructive. There is a symmetry in the model. Exchanging cluster 0 and cluster 1 and at the same time exchanging p and 1-p leads to an equivalent configuration with the same probability. The prior clearly has that symmetry. The true posterior will have the same symmetry. The distribution obtained by MCMC is very far from symmetric. Markov chain has been trapped in a metastable state. That is good for us, because that distribution is easier to interpret. The true posterior distribution with its bimodal marginals would probably confuse the reader.

We can either point out the above to the reader, or we can change the example to a model that does not have the symmetry. For example one could introduce another positive hyperparameter and set the mean of the second cluster to be different from the mean of the first cluster by this parameter. An exponential prior could be chosen for that new parameter. I implemented that and the convergence of the MCMC is better.

CamDavidsonPilon commented 11 years ago

I'm glad you pointed this out. I have been worried about this example exactly for this non-identifiablity reason. Ironically, it is quite an awful example to demonstrate simple MCMC convergence.

I like your proposed solution, do you mean something like the following?

mean1 = mc.Normal( "mean1", 150, 0.0001 )
difference = mc.Exp( "meandifference", 100 )

@deterministic
def mean2( m = mean1, d = difference )
    return m + d

gustavdelius commented 11 years ago

Yes, that is roughly what I had in mind. What I actually implemented was

distance = mc.Exponential("distance", 0.05)
midpoint = mc.Normal( "midpoint", 150, 0.0001 )

@mc.deterministic
def centers( m = midpoint, d = distance ):
    c = np.empty(2, dtype=object)
    c[0] = m-d
    c[1] = m+d
    return c

which is essentially the same as your proposal. The crucial feature is that center_0 is always less than center_1 because distance is a positive random variable. Using the above I did not have to change anything else in your code.

CamDavidsonPilon commented 11 years ago

I was thinking about this, could the problem also be resolved by specifying more accurate priors? It was naive of me to specify both priors to be Normal(150, \sigma). Instead I could be more specific about the priors and set one to be Normal( 120, \sigma) and the other Normal( 200, \sigma). This, while technically not removing the identifiablity problem, does bias the inference away from it.

Technically, my original "hunch" was that there were two present, and by specifying identical priors, I was discarding that hunch.

I ran some code, and the chains do seem to settle nicer. What are your thoughts?

gustavdelius commented 11 years ago

It depends on whether we are talking about a practical matter or a matter of principle. In practice the MCMC nicely stays close to one of the quasistationary states on the timescale for which we run the simulation. In principle however, this is not close to the true posterior distribution and if the Markov chain was allowed to run for very long times, eventually by chance it would switch over to the other quasistationary state. Then, after an even longer time, it would switch back again. In the really long run it would have switched forth and back enough for the simulation to approximate the true posterior. If you start with an asymmetric prior then the true posterior will no longer be exactly symmetric, but it will still show some vestige of the symmetry and will not be well approximated by the quasistationary distribution of the Markov chain. I am not an expert in MCMC and so don't believe anything I say. I am currently only expressing my intuition about this.

CamDavidsonPilon commented 11 years ago

This is the same intuition I had. With the asymmetric prior, the resulting posterior is asymmetric (still bimodal but not equal in height.), so one mode will dominate the (short-run) MCMC traces. By establishing specific priors, we can emphasize which mode we wish to dominate.

CamDavidsonPilon / Probabilistic-Programming-and-Bayesian-Methods-for-Hackers

Issue with clustering example in chapter 3 #34