JohannesBuchner / PyMultiNest

Pythonic Bayesian inference and visualization for the MultiNest Nested Sampling Algorithm and PyCuba's cubature algorithms.
http://johannesbuchner.github.io/PyMultiNest/
Other
191 stars 87 forks source link

MPI parallelization not working #242

Closed npaulson closed 7 months ago

npaulson commented 8 months ago

Hello!

I am running pymultinest with python 3.9.18, GCC 11.2.0, CentOS Linux 7 (Core). I can successfully run pymultinest_demo_minimal.py serially, but when I try to run with MPI: mpiexec -n 4 python pymultinest_demo_minimal.py

I get the following output:

sampling time: 25.369629859924316s
sampling time: 25.36968970298767s
sampling time: 25.369601488113403s
sampling time: 25.36963987350464s

In other words, it's simply running inference four times in parallel instead of parallelizing the calculation.

I have dug a bit into the pymultinest source code and confirmed that pymultinest\run.py loads lib_mpi and that use_MPI is not False.

Furthermore, the mpi4py hello world example works:

mpiexec -n 5 python -m mpi4py.bench helloworld

Hello, World! I am process 0 of 5 on bdw-0220.
Hello, World! I am process 1 of 5 on bdw-0220.
Hello, World! I am process 2 of 5 on bdw-0220.
Hello, World! I am process 3 of 5 on bdw-0220.
Hello, World! I am process 4 of 5 on bdw-0220.

Any help would be greatly appreciated!

JohannesBuchner commented 8 months ago

In other words, it's simply running inference four times in parallel instead of parallelizing the calculation.

why do you think that?

Any print that does not check the MPI rank will be run by all processes.

npaulson commented 8 months ago

Thank you for the assistance!

That makes sense, I am obviously not very MPI fluent. With that in mind, here are my scaling results with the number of ranks used (400 live points and 4 dimensional posterior): 1: 23.5s 2: 23.4s 4: 21.6s 8: 25.6s 16: 28.3s 32: 27.9s

These results are not as impressive as I remember obtaining in the past. Any idea what is happening?

JohannesBuchner commented 8 months ago

Hard to tell from here. Your parallelisation may be fighting for limited resources, for example, memory.

JohannesBuchner commented 8 months ago

you saw https://johannesbuchner.github.io/UltraNest/performance.html#parallelisation ?

npaulson commented 8 months ago

I have not seen UltraNest. I will look into this and see if it solves my problems.

npaulson commented 7 months ago

Hi Johannes,

I had the chance to explore UltraNest on my computing cluster and do some informal scaling studies. I'm guessing it scales similarly with MultiNest. Perhaps you've done something similar, but here is my study.

I developed a test function which has N gaussian modes (with diagonal covariance matrix and a 0.4 unit half-width) in a D dimensional unit hypercube (with bounds between 0 and 1). I used a uniform prior in each dimension bounded by 0 and 1. The likelihood function was a normal distribution where the standard deviation was assigned to be 0.05 times the maximum magnitude point observed in the dataset. The dataset consisted of 10,000 points drawn from a Sobol sequence in the unit hypercube. The model was simply a multivariate gaussian with a diagonal covariance matrix and a 0.4 unit half-width. The parameters inferred described the location of this gaussian in the unit hypercube.

Here are my key observations:

Here is the raw data from the study: <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40">

ndims | nnodes | nmodes | nranks | logz | time (s) | nlive | MPI ratio -- | -- | -- | -- | -- | -- | -- | -- 3 | 1 | 3 | 0 | -157264 | 38.62 | 400 | 1 3 | 1 | 3 | 1 | -157264 | 37.76 | 400 | 1.022775 3 | 1 | 3 | 2 | -157264 | 30.83 | 400 | 1.252676 3 | 1 | 3 | 4 | -157264 | 27.82 | 400 | 1.38821 3 | 1 | 3 | 8 | -157265 | 26.8 | 400 | 1.441045 3 | 1 | 3 | 16 | -157264 | 27.21 | 400 | 1.419331 3 | 1 | 3 | 32 | -157264 | 27.74 | 400 | 1.392213 3 | 1 | 3 | 0 | -157265 | 80.68 | 800 | 1 3 | 1 | 3 | 1 | -157265 | 75.66 | 800 | 1.066349 3 | 1 | 3 | 2 | -157264 | 61.26 | 800 | 1.317009 3 | 1 | 3 | 4 | -157264 | 53.91 | 800 | 1.496568 3 | 1 | 3 | 8 | -157264 | 52.4 | 800 | 1.539695 3 | 1 | 3 | 16 | -157264 | 55 | 800 | 1.466909 3 | 1 | 3 | 32 | -157264 | 55.28 | 800 | 1.459479 5 | 1 | 5 | 0 | -167964 | 87.85 | 400 | 1 5 | 1 | 5 | 1 | -167964 | 79.92 | 400 | 1.099224 5 | 1 | 5 | 2 | -167964 | 63.2 | 400 | 1.390032 5 | 1 | 5 | 4 | -167963 | 56.19 | 400 | 1.563445 5 | 1 | 5 | 8 | -167964 | 50.76 | 400 | 1.730693 5 | 1 | 5 | 16 | -167963 | 47.53 | 400 | 1.848306 5 | 1 | 5 | 32 | -167963 | 44.56 | 400 | 1.971499 10 | 1 | 10 | 0 | -37482.6 | 442.01 | 400 | 1 10 | 1 | 10 | 1 | -37483 | 425.12 | 400 | 1.03973 10 | 1 | 10 | 2 | -37483.3 | 312.45 | 400 | 1.414658 10 | 1 | 10 | 4 | -37482.8 | 243.71 | 400 | 1.813672 10 | 1 | 10 | 8 | -37483.2 | 194.33 | 400 | 2.274533 10 | 1 | 10 | 16 | -37482.8 | 153.41 | 400 | 2.881233 10 | 1 | 10 | 32 | -37482.4 | 118.33 | 400 | 3.735401 10 | 2 | 10 | 64 | -37482.6 | 101.33 | 400 | 4.362084 10 | 4 | 10 | 128 | -37483 | 92.86 | 400 | 4.759961 20 | 1 | 1 | 0 | 21260.740 +- 0.284 | 4928 | 400 | 1 20 | 1 | 1 | 1 | 21260.381 +- 0.399 | 4924.11 | 400 | 1.00079 20 | 1 | 1 | 2 | 21260.075 +- 0.351 | 3612.79 | 400 | 1.364043 20 | 1 | 1 | 4 | 21260.215 +- 0.564 | 1802.1 | 400 | 2.734587 20 | 1 | 1 | 8 | 21260.582 +- 0.473 | 1090.93 | 400 | 4.517247 20 | 1 | 1 | 16 | 21260.223 +- 0.185 | 662.38 | 400 | 7.439838 20 | 1 | 1 | 32 | 21259.973 +- 0.303 | 454.22 | 400 | 10.84937

JohannesBuchner commented 7 months ago

Yeah, I guess this is not unexpected. If you are at low dim, MLFriends is efficient and there is not that much to parallelize, and you have the communication overhead of MPI. If you are at high dim, the efficiency is low and there is more to parallelise, so you are seeing good scaling. If your likelihood was slower, the scaling would probably be closer to linear.

For your actual application, with the link I wanted to tell you to make sure you set the environment variable OMP_NUM_THREADS=1 to avoid over-parallelisation within each MPI node by various libraries you may be using.

npaulson commented 7 months ago

For my application, I perform repeated inference in 3-4 dimensions. I think I might need to turn to approximate inference (e.g. pyro) to get the speed I need.

Thanks for the MPI suggestion and attention to my application!

JohannesBuchner commented 7 months ago

I think in this case it would be better to 1) avoid MPI within nested sampling (this can be achieved by uninstalling mpi4py) and 2) distribute the repeated inference directly instead (e.g., with separate processes).

gabrielastro commented 5 months ago

I think in this case it would be better to 1) avoid MPI within nested sampling (this can be achieved by uninstalling mpi4py) and 2) distribute the repeated inference directly instead (e.g., with separate processes).

Actually, there is an easier approach than uninstalling a package one would like to deactivate:

import sys as sys
sys.modules['mpi4py'] = None

After this, the python session will think that the mpi4py module is not available but no file will be changed on your computer.

JohannesBuchner commented 5 months ago

Huh, neat trick.