Best-Practice for Distributed Computing

kabeleh commented 3 years ago

Say I am running MontePython on several desktop computers. What is the best-practice to run this? Can I just start the chains on the computers and store the chains locally on each machine, and when I think the chains are long enough I collect them and analyse them on one computer? Or is it better to store the chains on a network folder, such that each chains can "profit" from the intermediary results of the other chains via --superupdate? What if some machines have only a rather slow access to the shared folder? Run it locally on each machine, but setup rSync to copy over the chains of the other machines from time to time?

What is the recommended approach here? Thank you very much in advance.

brinckmann commented 3 years ago

Interesting problem. I/O time can be significant on slow systems, you can in principle mitigate this by using larger numbers for data.write_step (specified in the param file) and for the run options --superupdate and --update , but it's hard to know a priori what would be the most efficient. I'm not able to give you any solid recommendations and you'll have to use some trial and error if you choose to go in this direction. But there's no reason why it wouldn't work, as it would be just like running in parallel provided step 1) below is fulfilled. If the options --superupdate and --update are enabled, MontePython shares all information via files, i.e. through the chains, through the covmat, and through jumping_factor.txt

I can give you some general guidelines for running across multiple systems and combining chains later: 1) Obviously you need to make sure you run with identical settings: same class version, same likelihood settings, same param file settings, same run settings. 2) You need to use the same covmat for all runs. This is easily managed with --update if they share a network folder. But if the machines are not sharing a directory, you need to input a covmat with the run option -c and disable --update (by passing a 0). To produce a suitable covmat for the actual run will most likely involve restarting a couple of times. You would first do a run with no or a best guess covmat, then you restart after you have accumulated a decent points and computed a new covmat with the info flag --want-covmat that you then pass as input. This was actually the standard workflow prior to the introduction of update in version 2.2, but it's a bit tedious as you don't know "when is the right point to stop and compute a covmat" and it may take a few iterations (which is what update does for you). 3) You should turn superupdate off (by passing a 0). This means you have to be a little more careful with what you use as your jumping factor, controlled by the run flag -f . A value of -f 2.4 is recommended for perfect Gaussian posteriors, while Planck18 prefers a value of -f 1.9 . Note that an initial run with a small value might help you get a starting covmat in step 2) above, e.g. -f 0.5 (which is not suitable for a run you want to analyze).

Points 1-3) ensure that your chains are Markovian, which is important for obtaining reliable results with the properties you're looking for when running MCMC chains. E.g. combining chains that used a different covmat might bias your results, while runs with different settings would just be nonsense. Point 3) is less crucial, as it just modulates the amplitude of jumps, but not preferred degeneracy directions, so it's less likely to bias your results, but it's also less beneficial and not worth the uncertainty of knowing if it mattered.

Good luck!

Best, Thejs

kabeleh commented 3 years ago

Thank you very much for your answer! I went for using a shared folder. As in doing so, all machines share the same log.param, I had to make sure the absolute paths are identical. I created /home/user/computation/ on each machine (where also the user has the exact same name) and created the folders CLASS, MontePython, and Planck in this folder on each machine, and used the same CLASS source code, the same version of MontePython, and also the same plik.

Performance-wise I have no robust data. I have some older benchmarks for one machine, where I obtained 0.04 seconds / step as a best result. Now, the time / step went up from 0.04 s to 0.06 s. But it cannot be compared directly, as the previous benchmark was done with different likelihoods. Furthermore, the machine I benched is now the host, i.e. it has also to manage the network traffic from the remote chains.

In the remote consoles, I see less action happen than in the console of the host. Can this be due to I/O throttling through the network? I mean, does it wait until it has written everything, before it proceeds with the next steps? Or does it just issue a write command every 5 steps, but proceeds already with the next 5 steps and keeps everything in the memory until it is written into the chain-files? The network from the remote machines seems not particularly convoluted, around 200 KiB/s on gigabit Ethernet connections and the host is connected via a 10 GBit connection - the load is still low. How does data.write_step affect the memory consumption? The memory consumption is 19% on the machine with the lowest memory consumption, running 1 chain on a dual core with hyper-threading, and 58% on the machine with the highest memory consumption, running 8 chains on 16 physical cores / 32 logical cores.

All in all I can say that it was surprisingly easy to set it up, as long as the same / similar enough operating systems are installed on the machines and one manages somehow to use the same absolute paths. Performance-wise I cannot say much, as this would require me to do some proper benchmarks first. So far I can only compare to previous runs. It seems the performance is more or less on par, but with the benefit that now several machines can work together on the same run. But this last statement has to be taken with a whole load of salt...

Thank you again, @brinckmann , for your detailed answer!

brinckmann commented 3 years ago

The chains will "pause" while they're writing and continue after it's done, so that's why I'd expect a relevant slow down if that part is slow.

I wouldn't expect it to take up significant memory if you increase the write step parameter, after all it's just storing one more line of information. I might try setting it to 50 or 100.

I notice you run chains with 2 cores per chain, of course on some systems you can't do better than that, but I usually recommend around 6-8 cores per chain. Much of the time spent per step is spent in processes that are parallelized (e.g. class), and for Metropolis-Hastings longer chains usually trump more chains in terms of convergence efficiency. My preference is typically to run 6-8 chains each on 6-8 cores. Fewer chains and the R-1 statistic becomes increasingly unreliable (I wouldn't use less than 4 chains if it can be avoided), while fewer cores per chain will make the steps take quite a while.

As an example, if I had only a 16 core and a 2 core system I would probably only run on the 16 core system and run 4 chains on 4 cores each. If I had a 4 core system, an 8 core one, and a 16 core one, I'd probably still do 4 cores per chain. But if I had four 4 core systems, four 8 core systems, and a 16 core system, I'd probably have four chains run on 4 cores each (the 4 core systems) and 6 chains run on 8 cores each (the 8 and 16 core systems). Given you went through the effort to set up on the 2 core system, I guess you might as well run on there too, even if it's only one chain with two cores, but I'd expect it to be slow at accumulating points unless the cores are very fast.

Best, Thejs

kabeleh commented 3 years ago

Thanks again for your reply.

Concerning data.write_step: I know it is bad practice to change the log.param manually, after a run started. But in this case, could I stop the chains, change data.write_step in log.param and just restart the chains?

Concerning the number of cores: I actually did some benchmarking on the 16 cores machine, as this was the first one I had full access to, and it was the first time I worked with MontePython, so I wanted to know how it performs. For instance, I performed 10'000 steps and limited the number of threads that can be used and set the number of MontePython chains and got the following results:

10000	steps
Total Threads	CLASS M	Chains -np	MontePython -N	Time [s]	Time [s] / step
16	1	16	625	610	0.061
16	2	8	1251	594	0.059
16	4	4	2500	607	0.061
16	8	2	5000	2932	0.293
16	16	1	10000	806	0.081
32	1	32	312.5
32	2	16	1250	966	0.097
32	4	8	1252	402	0.040
32	8	4	2501	439	0.044
32	16	2	5000	3010	0.301
32	32	1	10000	1064	0.106

Interestingly, the performance with two chains is worse than with just one chain. Best-performing was the "4 cores per chain"--run. But of course, I should have considered that "for Metropolis-Hastings longer chains usually trump more chains in terms of convergence efficiency." Thank you very much. I will go with 6-8 cores per chain from now on! :)

brinckmann commented 3 years ago

To clarify, as you can guess from the R-1 criterion there are two primary strategies for beating down the variance (improve convergence), one is to have many smaller chains, the other is to have fewer longer chains. The emcee algorithm specifically uses the former strategy, while Metropolis-Hastings takes advantage of the latter. I don't think a statistical study has been done to see if e.g. 6x8 cores is better than 12x4 cores, so it's possible I'm wrong. If you find different results than conventional wisdom indicates that could be interesting. It could also reasonably depend on your cosmological model and likelihoods.

I suspect changing data.write_step alone is not enough. You should look for "write_step" later in your log.param to see if it has been saved as a part of the data array. But yes, I should think you could change it between jobs without problems. That said, I've never tried to, so all the usual caveats and warnings apply (maybe make a backup directory).

Best, Thejs

kabeleh commented 3 years ago

Thank you for the clarification! I will keep this in mind for future runs. If I ever find the time, I'll do a proper benchmark.

Neither the word step nor the word write appears more than once in the log.param. It didn't complain about anything, after changing data.write_step.

Thank you for your support! It's fun to work with MontePython.

brinckmann / montepython_public

Best-Practice for Distributed Computing #224