brinckmann / montepython_public

Public repository for the Monte Python Code
MIT License
93 stars 77 forks source link

Bad convergence due to length of each run? #296

Closed LisaGoh closed 1 year ago

LisaGoh commented 1 year ago

Hi,

I'm currently doing MCMC runs on the Planck+BAO likelihood with my own modified version of CLASS. Despite having run the chains for really long (with 32 cores) I am getting very bad convergence. I attach a trace plot for one of the parameters (they all look similar), plotted with 3 of the chains. The values are jumping a lot.

Currently my setup is: running the chain for 48hours (hard limit of each job set by my cluster) and using the most recent bestfit and covmat generated each time I start a new run. I am wondering if these jumps are happening because within the 48hours, only a few points are being generated (I admit my version of CLASS runs very slow ><) and the chains can never reach convergence but end up getting killed. Then each run starts again but jumps crazily at the beginning. So my question is does it make a difference that there is a limit on the time of each run, or is it fine as long as I start again and it's just an issue with my model/likelihood implementation?

Thank you!

Screenshot 2022-09-30 at 11 20 25

brinckmann commented 1 year ago

Hi Lisa,

I realize I'm late reacting to this so it may be too late, but I'll provide two suggestions anyway for posterity, as they're common issues.

Make sure you're careful with continuity of covmats across multiple restarts of a job, as this could be the issue (I'm not exactly sure from what you're writing). Once you start a job the update algorithm will usually compute a suitable covmat located at your_chains_directory/your_chains_directory.covmat . If you restart a run with -r in the same directory then the correct covmat is automatically loaded (you should NOT compute a new one manually, it will cause problems, if you want to use a new covmat you should start over completely using that as input), overruling whatever is passed with -c . If you restart with -r in a new directory, you must pass the covmat that has been produced in the folder of the original run, i.e. -c your_chains_directory/your_chains_directory.covmat , or, if none was produced, you must pass the original covmat used when starting the run. If the covmat is changed during a run it may not converge (indeed, anything before the last covmat update is not "Markovian" and should in principle be excluded from the analysis for it to be an proper MCMC run).

Another thing to point out regarding restarting runs is if you use the -r option it will start out by copying your old chains, e.g. with -N 100000 from olddate_100000__1.txt to newdate_200000__2.txt , so best practice is to remove the old chains files once it's done copying (or at least before analyzing and/or restarting again). The analyze module doesn't like old files and the update algorithm may spuriously compute covmats due to thinking the convergence is worse than it is (although this isn't very common), and it will also affect the printed convergence at the end unless you're careful to only analyze the newest chains alone.

If you're not using the -r option it's a whole different story that I can try to explain separately, if needed.

If you're still having a problem and the above doesn't help it'd be useful if you provide more information about your workflow and setup to help iron out any problems.

Best, Thejs

LisaGoh commented 1 year ago

Hi Thejs,

Thank you very much for your tips! I am currently starting the runs with a covmat, and if say my run finishes before a new covmat could be calculated in the directory the chain is in, is it ok to pass the -r option with the same covmat that I started the runs with?

Regards, Lisa

brinckmann commented 1 year ago

Hi Lisa,

Yes, this is the correct approach. The idea is to continue using whatever covmat was being used last, so if the code didn't compute a new one then the old one should be passed when restarting with -r .

Note that there's a small probability that your starting covmat is "good, but not great", so that it is good enough for the code to not compute a new one, but not so good that your runs converge in a timely manner. This is very subjective, but if the convergence seeems to get "stuck" around R-1 ~ 0.02-0.04 this might be why. In that case I usually compute a new covmat from the full chains and start over from scratch in a new directory with the new covmat.

Best, Thejs