Why you should always run several independent MCMCs for checking convergence for the DE population MCMC family

florianhartig commented 2 years ago

The DE sampler family (DE, DEzs, DREAM, DREAMzs) runs internally several chains. This is the principle of this sampler class, and they are therefore called population MCMCs.

If you plot the trace plot for a single MCMC of this sampler class, you will see several MCMC chains. Technically, the object we are operating on here is a an MCMC list, with the same structure as, e.g., 3 independent Metropolis MCMC runs. There, we can obviously also run Gelman Rubin diagnostics on these chains without producing an error.

The question is: is this a good idea, or do you run several independent population MCMCs to assess convergence?

I don't have the reference here right now, but if I recall correctly, Cajo ter Braak proved in one of the DE papers that the internal chains sampling independently once the sampler is ergodic, i.e. the Z matrix is fully representative of the target distribution. This would suggest that it is sufficient to run a single DE sampler and assess convergence.

The problem with this is this conclusions lies directly in the premises: the proof only states that you can assess convergence with the internal chains (independence) once the sampler has converged. But the very reason why we run convergence checks is that we want to check if the sampler has converged. So, what happens if the sampler didn't convergence yet.

My experience is that before convergence, there is notable correlation between internal chains in the DE samplers. Here an example calibrating the VSEM model, running two DEzs MCMCs with 3 internal chains each:

You can clearly see that internal chains are closer to each other than the 2 independent MCMCs. I paste the code below for reference.

For this reason, checking convergence diagnostics (e.g. Gelman Rubin) on the internal chains only will lead to overoptimistic results, and you should always run several independent MCMCs (usual recommendation: set nrChains = 3-5).

library(BayesianTools)

# Create input data for the model
PAR <- VSEMcreatePAR(1:1000)
plot(PAR, main = "PAR (driving the model)", xlab = "Day")

# load reference parameter definition (upper, lower prior)
refPars <- VSEMgetDefaults()
# this adds one additional parameter for the likelihood standard deviation (see below)
refPars[12,] <- c(2, 0.1, 4) 
rownames(refPars)[12] <- "error-sd"
head(refPars)

# create some simulated test data 
# generally recommended to start with simulated data before moving to real data
referenceData <- VSEM(refPars$best[1:11], PAR) # model predictions with reference parameters  
referenceData[,1] = 1000 * referenceData[,1] 
# this adds the error - needs to conform to the error definition in the likelihood
obs <- referenceData + rnorm(length(referenceData), sd = refPars$best[12])
oldpar <- par(mfrow = c(2,2))
for (i in 1:4) plotTimeSeries(observed = obs[,i], 
                              predicted = referenceData[,i], main = colnames(referenceData)[i])

# Best to program in a way that we can choose easily which parameters to calibrate
parSel = c(1:6, 7,8,9, 12)

# here is the likelihood 
likelihood <- function(par, sum = TRUE){
  # set parameters that are not calibrated on default values 
  x = refPars$best
  x[parSel] = par
  predicted <- VSEM(x[1:11], PAR) # replace here VSEM with your model 
  predicted[,1] = 1000 * predicted[,1] # this is just rescaling
  diff <- c(predicted[,1:4] - obs[,1:4]) # difference betweeno observed and predicted
  # univariate normal likelihood. Note that there is a parameter involved here that is fit
  llValues <- dnorm(diff, sd = x[12], log = TRUE)  
  if (sum == FALSE) return(llValues)
  else return(sum(llValues))
}

# optional, you can also directly provide lower, upper in the createBayesianSetup, see help
prior <- createUniformPrior(lower = refPars$lower[parSel], 
                            upper = refPars$upper[parSel], best = refPars$best[parSel])

bayesianSetup <- createBayesianSetup(likelihood, prior, names = rownames(refPars)[parSel])

# settings for the sampler, iterations should be increased for real applicatoin
settings <- list(iterations = 10000, nrChains = 2)

set.seed(123)
out <- runMCMC(bayesianSetup = bayesianSetup, sampler = "DEzs", settings = settings)

plot(out)

twest820 commented 2 years ago

you should always run several independent MCMCs (usual recommendation: set nrChains = 3-5).

Hi Florian, I think this mostly shifts the question to how to verify a particular choice of nrChains is providing an accurate Gelman-Rubin statistic. Which isn't something I've seen addressed in the MCMC papers I've read through, so might be more of a research question than a package question.

There's also yesterday's question about burnin.

if I recall correctly, Cajo ter Braak proved in one of the DE papers that the internal chains sampling independently once the sampler is ergodic

ter Braak 2005 has this bit for DE-MCMC. It's repeated in ter Braak and Vrugt 2008 for DEz:

Because the joint stationary pdf of the N chains factorizes to π(x1)×…×π(xN), the states x1 … xN of the individual chains are independent at any generation after DE-MC has become independent of its initial value. This feature of population MCMC samplers, first noticed by Mengersen and Robert (2003), is important for monitoring the convergence of a DE-MC run with the R-statistic of Gelman et al. (2004).

Checking through what are hopefully the most relevant papers, I'm not seeing this claim is explicitly made for DEzs, DREAM, or DREAMzs, though it doesn't seem unreasonable to read the texts as assuming this property is general to the DE-MCMC family of algorithms. Unfortunately, Mengersen and Robert 2003 is paywalled for me and I'm not spotting anything else particularly relevant with a bit of citation checking and keyword searching.

One way of framing this might be to ask, given a fixed number of iterations provided by available computing capacity and runtime, what is the optimal allocation of those iterations between between chain length and number of chains. I know there was discussion of fewer longer chains versus more shorter ones in the 1990s but am not finding more recent works which are relevant here.

florianhartig commented 2 years ago

OK, the chain length / number is also a valid discussion, but you can only answer it if you can asses convergence in the first place.

The text that you quit from ter Braak and Vrugt 2008 leaves room for interpretation - "is important for monitoring the convergence of a DE-MC run" - could mean anything between they think a single MCMC run can be monitored with Gelman-Rubin, or that it shouldn't.

There is also the question whether you want to use Gelman-Rubin to monitor burnin or convergence. This is often misunderstood, and most people conflate both approaches. Again:

If you know that the chain was stationary, or had some other means to determine it, it would be OK to use Gelman-Rubin on a single DE MCMC. Maybe you could eye-ball the chain to see when it doesn't show trends any more, but this is hard to do when running only a single MCMC. Anyway, you if you know the point, you can throw away the burn-in, and then use Gelman-Rubin on a single MCMC to assess whether the estimated distribution has converged to the true distribution.
However, you never really know if you have set in the burn-in correctly, or if the chain is still moving. And before the stationarity, internal MCMC chains are correlated. I have seen many examples where internal chains seemed to have converged, but external ones didn't. Your remarks that setting nrChains > 1 reduces convergence attests to that.

I also think that running several chains is rarely an issue in practice, as practically all users have access to multi-core systems, and the parallel run of several MCMCs therefore comes practically without any costs, aside from the inconvenience to start the same script 3 times.

florianhartig commented 2 years ago

Feel free to add more comments here, but I will close this issue now because I have added a bit of help and I think there is nothing more to be done on the coding side.

florianhartig / BayesianTools

Why you should always run several independent MCMCs for checking convergence for the DE population MCMC family #226