CobayaSampler / cobaya

Code for Bayesian Analysis
http://cobaya.readthedocs.io/en/latest/
Other
126 stars 125 forks source link

MPI MCMC hanging #308

Closed vivianmiranda closed 1 year ago

vivianmiranda commented 1 year ago

Hello Jesus and Antony

I hope you are all well and safe. I am sending this message because I am facing difficulties in the new Cobaya.

Below, I attached the YAML file of a simple Gaussian likelihood - no CAMB (nothing fancy - this is pure Cobaya). When I run with four MPI walkers - usually three die, but it is random (see picture below), i.e., the sampler hangs. The problem only happens when I have >= 4 MPI walkers (up to 3 is fine - see picture at the end).

Screenshot 2023-08-16 at 11 21 44 PM

YAML file: EXAMPLE_MCMC8.txt (I had to convert the YAML file to text so I can upload it here).

I run Cobaya with the following command:

mpirun -n 4 --mca btl tcp,self --bind-to core --map-by numa:pe=1 cobaya-run EXAMPLE_MCMC8.yaml -f

The conda env that I run Cobaya is the following:

 `conda create --name cobaya python=3.8 --quiet --yes \
   && conda install -n cobaya --quiet --yes  \
   'conda-forge::libgcc-ng=12.3.0' \
   'conda-forge::libstdcxx-ng=12.3.0' \
   'conda-forge::libgfortran-ng=12.3.0' \
   'conda-forge::gxx_linux-64=12.3.0' \
   'conda-forge::gcc_linux-64=12.3.0' \
   'conda-forge::gfortran_linux-64=12.3.0' \
   'conda-forge::openmpi=4.1.5' \
   'conda-forge::sysroot_linux-64=2.17' \
   'conda-forge::git=2.40.0' \
   'conda-forge::git-lfs=3.3.0' \
   'conda-forge::fftw=3.3.10' \
   'conda-forge::cfitsio=4.0.0' \
   'conda-forge::hdf5=1.14.0' \
   'conda-forge::lapack=3.9.0' \
   'conda-forge::openblas=0.3.23' \
   'conda-forge::lapack=3.9.0' \
   'conda-forge::gsl=2.7' \
   'conda-forge::cmake=3.26.4' \
   'conda-forge::xz==5.2.6' \
   'conda-forge::armadillo=11.4.4' \
   'conda-forge::boost-cpp=1.81.0' \
   'conda-forge::expat=2.5.0' \
   'conda-forge::cython=0.29.35' \
   'conda-forge::scipy=1.10.1' \
   'conda-forge::pandas=1.5.3' \
   'conda-forge::numpy=1.23.5' \
   'conda-forge::matplotlib=3.7.1' \
   'conda-forge::mpi4py=3.1.4'`

Any idea what may have been going on? I will continue my investigation.

PS: this picture shows that if I run the same example with only 3 MPI walkers - the chain converges in a few seconds (so MPI is working)

Screenshot 2023-08-16 at 11 23 09 PM

PS2: This is Cobaya v3.3.1

cmbant commented 1 year ago

I tested your file with 4 procs on windows and seems to run fine. Some problem with your mpi/cluster/network/run config for larger number of chains?

cmbant commented 1 year ago

I'm not sure why in my example I get reported acceptance rate of 1 where are your example with 3 does not. Did that have oversample_thin set? (I guess you were tryingt to test some speed hierarchy via this odd yaml setup?)

vivianmiranda commented 1 year ago

Yes - I do have oversample_thin, and I also I do see acceptance rate = 1 (sorry for not reporting this - this is something I've noticed for a long time).

The YAML file is odd because it was adapted from a YAML file I used on Cosmolike LSST chains where I do have a speed hierarchy on parameters.

Thanks a lot for testing my YAML. My bug is bizarre - I will investigate if this is due to the cluster or somehow the conda env I chose.

This is the original YAML I adapted this example from (and where I noted my bizarre bug) - https://github.com/SBU-COSMOLIKE/cocoa_lsst_y1/blob/main/EXAMPLE_MCMC1.yaml

cmbant commented 1 year ago

Your picture of the outputs above (with 3 mpi) reports normal acceptance rates though, not 1?

vivianmiranda commented 1 year ago

Good catch - but here is a Cosmolike LSST-Y1 chain I ran last night with only 3 MPI to test that my problem was caused only by four or more MPI cores.

Acceptance rate = 1

Screenshot 2023-08-17 at 7 16 25 AM
vivianmiranda commented 1 year ago

To be fair - in many of my chains, the acceptance doesn't start at 1 (I attached the beginning of an LSST-Y1 3x2pt chain) - but it slowly drifts towards 1

Screenshot 2023-08-17 at 7 22 27 AM
lukashergt commented 1 year ago

Hi both,

I just stumbled over this and thought I'd mention that I have seen this type of hanging, too. For me it happens when I run on all cores, e.g. with

mpirun --use-hwthread-cpus --map-by ppr:8:socket:pe=4 cobaya-run example.yaml

It always happens after

[0 : mcmc] All chains are ready to check convergence and learn a new proposal covmat

where rank 0 becomes stuck, not doing the convergence computation. All the other ranks keep going for me (so that is a little different from @vivianmiranda's case above, where only rank 1 continues).

I, too, can avoid this by not using at least one core, e.g.

mpirun --use-hwthread-cpus --map-by ppr:7:socket:pe=4 cobaya-run example.yaml

will work.

If I do some rankfile wizardry and assign the leftover cpus to rank 0 such that all cpus are being used, it still works (with rank 0 being obviously faster than the others).

cmbant commented 1 year ago

No idea really.. in this example how many chains end up being run in the two cases? (I've never used --use-hwthread-cpus myself). If it's reproducible, some lower level investigation of what line/loop it's stuck on would be helpful if possible.

JesusTorrado commented 1 year ago

I will give it a try. In any case, my tests in the past show that (at least for my CPU) running MPI processes in hyper-threads (what --use-hwthread-cpus does) is slower than just running on physical cores and leaving the hyper-threads for openMP parallelization, even for samplers like polychord that benefit from a lot of parallelization. Possibly due to hyper-threads sharing CPU cache at some level, and having to dump an load more frequently because the cache per process gets halved.

In any case, @vivianmiranda, can you confirm whether @lukashergt insight fixes this for you?

lukashergt commented 1 year ago

The threads in my example above are used for openMP, the --use-hwthread-cpus isn't actually needed, the following commands are equivalent:

mpirun --map-by ppr:8:socket:pe=2 cobaya-run example.yaml

vs

mpirun --map-by ppr:7:socket:pe=2 cobaya-run example.yaml

In any case, @vivianmiranda, can you confirm whether @lukashergt insight fixes this for you?

Well, it's not really a fix, it's giving up computing power to make things work... and it seems like @vivianmiranda had a similar experience:

The problem only happens when I have >= 4 MPI walkers (up to 3 is fine - see picture at the end).

lukashergt commented 1 year ago

If it's reproducible, some lower level investigation of what line/loop it's stuck on would be helpful if possible.

Yes, it is reproducible for me. Full example with output below. I've sprinkled in print statements that let me chase it down to cobaya/samplers/mcmc/mcmc.py in the MCMC.check_covariance_and_learn_proposal method. It computes mean, cov, and acceptance_rate but then does not go beyond the following line:

            Ns, means, covs, acceptance_rates = mpi.array_gather(
                [self.n(), mean, cov, acceptance_rate])

 

Example

Here an example yaml file that lets me reproduce it both on my laptop and my PC (both running Manjaro Linux with gnu compilers and openMPI):

likelihood:
  gaussian3d: 'lambda x1, x2, x3: stats.multivariate_normal.logpdf((x1, x2, x3), mean=(0, 0, 0), cov=1)'
params:
  x1:
    prior:
      min: -6
      max: +6
    proposal: 0.5
  x2:
    prior:
      min: -6
      max: +6
    proposal: 0.5
  x3:
    prior:
      min: -6
      max: +6
    proposal: 0.5
sampler:
  mcmc:
    output_every: 10s
    learn_every: 200d
    drag: false

 

Failing

This command (which maps across all available resources) fails:

mpirun --display-map --map-by ppr:4:socket:pe=2 cobaya-run -f gaussian3d_mcmc.yaml

Here the output:

 ========================   JOB MAP   ========================

 Data for node: nefertari   Num slots: 8    Max slots: 0    Num procs: 4
    Process OMPI jobid: [20520,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]:[BB/BB/../../../../../..]
    Process OMPI jobid: [20520,1] App: 0 Process rank: 1 Bound: socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]:[../../BB/BB/../../../..]
    Process OMPI jobid: [20520,1] App: 0 Process rank: 2 Bound: socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]:[../../../../BB/BB/../..]
    Process OMPI jobid: [20520,1] App: 0 Process rank: 3 Bound: socket 0[core 6[hwt 0-1]], socket 0[core 7[hwt 0-1]]:[../../../../../../BB/BB]

 =============================================================
[0 : output] Output to be read-from/written-into folder '.', with prefix 'gaussian3d_mcmc'
[0 : output] Found existing info files with the requested output prefix: 'gaussian3d_mcmc'
[0 : output] Will delete previous products ('force' was requested).
[0 : gaussian3d] Initialized external likelihood.
[2 : gaussian3d] Initialized external likelihood.
[3 : gaussian3d] Initialized external likelihood.
[1 : gaussian3d] Initialized external likelihood.
[2 : mcmc] Getting initial point... (this may take a few seconds)
[3 : mcmc] Getting initial point... (this may take a few seconds)
[3 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[1 : mcmc] Getting initial point... (this may take a few seconds)
[2 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[1 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[0 : mcmc] Getting initial point... (this may take a few seconds)
[0 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[0 : model] Measuring speeds... (this may take a few seconds)
[0 : model] Setting measured speeds (per sec): {gaussian3d: 3660.0}
[0 : mcmc] Initial point: x1:-2.114555, x2:4.464631, x3:-4.804316
[1 : mcmc] Initial point: x1:-1.222867, x2:-0.8936096, x3:-3.68863
[3 : mcmc] Initial point: x1:-4.679505, x2:1.763063, x3:-1.343873
[2 : mcmc] Initial point: x1:-4.854909, x2:-4.265411, x3:-0.7470197
[0 : mcmc] Covariance matrix not present. We will start learning the covariance of the proposal earlier: R-1 = 30 (would be 2 if all params loaded).
[0 : mcmc] Sampling!
[1 : mcmc] Progress @ 2023-09-25 08:28:05 : 1 steps taken, and 0 accepted.
[0 : mcmc] Progress @ 2023-09-25 08:28:05 : 1 steps taken, and 0 accepted.
[3 : mcmc] Progress @ 2023-09-25 08:28:05 : 1 steps taken, and 0 accepted.
[2 : mcmc] Progress @ 2023-09-25 08:28:05 : 1 steps taken, and 0 accepted.
[3 : mcmc] Learn + convergence test @ 600 samples accepted.
[3 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[2 : mcmc] Learn + convergence test @ 600 samples accepted.
[2 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] Learn + convergence test @ 600 samples accepted.
[0 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[1 : mcmc] Learn + convergence test @ 600 samples accepted.
[1 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] All chains are ready to check convergence and learn a new proposal covmat
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, use_first=307
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, mean=[-0.25438876  0.0979461   0.01182902]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, cov[0]=[ 0.89045408  0.10958307 -0.18304003]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, acceptance_rate=0.5202702702702703
[1 : mcmc] Learn + convergence test @ 1200 samples accepted.
[1 : mcmc] Learn + convergence test @ 1800 samples accepted.
[1 : mcmc] Learn + convergence test @ 2400 samples accepted.

[...]

[1 : mcmc] Learn + convergence test @ 29400 samples accepted.
[1 : mcmc] Learn + convergence test @ 30000 samples accepted.
[1 : mcmc] Learn + convergence test @ 30600 samples accepted.
[1 : mcmc] *ERROR* Waiting for too long for all chains to be ready. Maybe one of them is stuck or died unexpectedly?
[1 : mcmc] Aborting MPI due to error

 

Working

This command (which leaves some resources idle) works:

mpirun --display-map --map-by ppr:3:socket:pe=2 cobaya-run -f gaussian3d_mcmc.yaml

Here the output:

 ========================   JOB MAP   ========================

 Data for node: nefertari   Num slots: 8    Max slots: 0    Num procs: 3
    Process OMPI jobid: [17030,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]:[BB/BB/../../../../../..]
    Process OMPI jobid: [17030,1] App: 0 Process rank: 1 Bound: socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]:[../../BB/BB/../../../..]
    Process OMPI jobid: [17030,1] App: 0 Process rank: 2 Bound: socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]:[../../../../BB/BB/../..]

 =============================================================
[0 : output] Output to be read-from/written-into folder '.', with prefix 'gaussian3d_mcmc'
[0 : output] Found existing info files with the requested output prefix: 'gaussian3d_mcmc'
[0 : output] Will delete previous products ('force' was requested).
[0 : gaussian3d] Initialized external likelihood.
[2 : gaussian3d] Initialized external likelihood.
[1 : gaussian3d] Initialized external likelihood.
[1 : mcmc] Getting initial point... (this may take a few seconds)
[1 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[2 : mcmc] Getting initial point... (this may take a few seconds)
[2 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[0 : mcmc] Getting initial point... (this may take a few seconds)
[0 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[0 : model] Measuring speeds... (this may take a few seconds)
[0 : model] Setting measured speeds (per sec): {gaussian3d: 6330.0}
[1 : mcmc] Initial point: x1:-5.866553, x2:4.781567, x3:-2.609341
[0 : mcmc] Initial point: x1:3.832811, x2:-0.2083741, x3:-2.232426
[2 : mcmc] Initial point: x1:3.198547, x2:-1.781943, x3:2.063676
[0 : mcmc] Covariance matrix not present. We will start learning the covariance of the proposal earlier: R-1 = 30 (would be 2 if all params loaded).
[0 : mcmc] *WARNING* The initial points are widely dispersed compared to the proposal covariance, it may take a long time to burn in (max dist to start mean: [6.889085583492769, 12.509642598266206, 5.979413546667012])
[0 : mcmc] Sampling!
[1 : mcmc] Progress @ 2023-09-25 08:47:49 : 1 steps taken, and 0 accepted.
[0 : mcmc] Progress @ 2023-09-25 08:47:49 : 1 steps taken, and 0 accepted.
[2 : mcmc] Progress @ 2023-09-25 08:47:49 : 1 steps taken, and 0 accepted.
[0 : mcmc] Learn + convergence test @ 600 samples accepted.
[0 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[1 : mcmc] Learn + convergence test @ 600 samples accepted.
[1 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[2 : mcmc] Learn + convergence test @ 600 samples accepted.
[2 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] All chains are ready to check convergence and learn a new proposal covmat
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, use_first=322
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, mean=[-0.11147499  0.08741611  0.17697065]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, cov[0]=[ 0.85461559 -0.05014254 -0.18165039]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, acceptance_rate=0.5278688524590164
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: end of `more_than_one_process` block
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `is_main_process` block
[0 : mcmc]  - Acceptance rate: 0.517 = avg([0.5278688524590164, 0.5134228187919463, 0.5076142131979695])
[0 : mcmc]  - Convergence of means: R-1 = 0.046282 after 1856 accepted steps = sum([644, 612, 600])
[0 : mcmc]  - Updated covariance matrix of proposal pdf.
[0 : mcmc] ----- done with `self.check_convergence_and_learn_proposal()`
[0 : mcmc] Learn + convergence test @ 1200 samples accepted.
[0 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[1 : mcmc] Learn + convergence test @ 1200 samples accepted.
[1 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[2 : mcmc] Learn + convergence test @ 1200 samples accepted.
[2 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] All chains are ready to check convergence and learn a new proposal covmat
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, use_first=641
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, mean=[0.01103656 0.04987475 0.07299755]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, cov[0]=[ 0.98725365 -0.00992855  0.0395221 ]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, acceptance_rate=0.3117704280155642
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: end of `more_than_one_process` block
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `is_main_process` block
[0 : mcmc]  - Acceptance rate: 0.312 = avg([0.3117704280155642, 0.32336255801959773, 0.3001500750375188])
[0 : mcmc]  - Convergence of means: R-1 = 0.009031 after 3736 accepted steps = sum([1282, 1254, 1200])
[0 : mcmc]  - Updated covariance matrix of proposal pdf.
[0 : mcmc] ----- done with `self.check_convergence_and_learn_proposal()`
[2 : mcmc] Learn + convergence test @ 1800 samples accepted.
[2 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] Learn + convergence test @ 1800 samples accepted.
[0 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[1 : mcmc] Learn + convergence test @ 1800 samples accepted.
[1 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] All chains are ready to check convergence and learn a new proposal covmat
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, use_first=934
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, mean=[0.05464783 0.00620509 0.03061425]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, cov[0]=[ 1.01902591 -0.05120135  0.0180134 ]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, acceptance_rate=0.29343386742067235
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: end of `more_than_one_process` block
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `is_main_process` block
[0 : mcmc]  - Acceptance rate: 0.298 = avg([0.29343386742067235, 0.29277813923227064, 0.3071381794368042])
[0 : mcmc]  - Convergence of means: R-1 = 0.015853 after 5543 accepted steps = sum([1868, 1800, 1875])
[0 : mcmc]  - Updated covariance matrix of proposal pdf.
[0 : mcmc] ----- done with `self.check_convergence_and_learn_proposal()`
[2 : mcmc] Learn + convergence test @ 2400 samples accepted.
[2 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] Learn + convergence test @ 2400 samples accepted.
[0 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[1 : mcmc] Learn + convergence test @ 2400 samples accepted.
[1 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] All chains are ready to check convergence and learn a new proposal covmat
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, use_first=1245
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, mean=[ 0.02273944 -0.03316428 -0.01299619]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, cov[0]=[ 1.04698319  0.01961972 -0.03058848]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, acceptance_rate=0.2978723404255319
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: end of `more_than_one_process` block
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `is_main_process` block
[0 : mcmc]  - Acceptance rate: 0.298 = avg([0.2978723404255319, 0.2894356005788712, 0.30709426627793973])
[0 : mcmc]  - Convergence of means: R-1 = 0.006581 after 7418 accepted steps = sum([2491, 2400, 2527])
[0 : mcmc]  - Updated covariance matrix of proposal pdf.
[0 : mcmc] ----- done with `self.check_convergence_and_learn_proposal()`
[2 : mcmc] Learn + convergence test @ 3000 samples accepted.
[2 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] Learn + convergence test @ 3000 samples accepted.
[0 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[1 : mcmc] Learn + convergence test @ 3000 samples accepted.
[1 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] All chains are ready to check convergence and learn a new proposal covmat
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, use_first=1550
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, mean=[-0.02159444 -0.02334458 -0.00277383]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, cov[0]=[ 0.97367776  0.04242339 -0.0146344 ]
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `more_than_one_process` block, acceptance_rate=0.3031488362996284
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: end of `more_than_one_process` block
[0 : mcmc] ----- in `check_convergence_and_learn_proposal`: `is_main_process` block
[0 : mcmc]  - Acceptance rate: 0.302 = avg([0.3031488362996284, 0.3049400284610693, 0.2988527724665392])
[0 : mcmc]  - Convergence of means: R-1 = 0.002088 after 9225 accepted steps = sum([3100, 3000, 3125])
[0 : mcmc]  - Convergence of bounds: R-1 = 0.029496 after 9225 accepted steps = sum([3100, 3000, 3125])
[0 : mcmc] The run has converged!
[0 : mcmc] ----- done with `self.check_convergence_and_learn_proposal()`
[0 : mcmc] Sampling complete after 9225 accepted steps.
cmbant commented 1 year ago

Thanks! I guess you are using mpi_info to output the debug lines, so only written for rank 0? Looks as though for some reason check_convergence_and_learn_proposal is not being called on the all the other ranks so rank 0 waits forever for the data. In your example, is rank 1 the only one that continues?

I guess this is most likely an issue with the asynchronous Isend in state.set or iprobe in synch, no idea why though. Can set mpi.log to a logger instance to get some more mpi logging.

Does it work if you don't explicitly bind things?

lukashergt commented 1 year ago

I guess you are using mpi_info to output the debug lines, so only written for rank 0?

Correct, re-done using self.log.info, see examples below.

Looks as though for some reason check_convergence_and_learn_proposal is not being called on the all the other ranks so rank 0 waits forever for the data.

Note how in the examples below the other ranks do compute Ns, means, covs, acceptance_rates. It is only rank 0 that ultimately does not. So check_convergence_and_learn_proposal is called on the other ranks, but rank 0 waits nonetheless.

In your example, is rank 1 the only one that continues?

It is one or multiple other rank(s) that continue except for rank 0. In my minimal gaussian example it seems to always be rank 1 that continues, but I have had other ranks continue before.

Does it work if you don't explicitly bind things?

No, also happens when unbound, see failing example below.

Failing example

(py3115env)$ mpirun --display-map --map-by ppr:4:socket cobaya-run -f gaussian3d_mcmc.yaml
 Data for JOB [23079,1] offset 0 Total slots allocated 8

 ========================   JOB MAP   ========================

 Data for node: nefertari   Num slots: 8    Max slots: 0    Num procs: 4
    Process OMPI jobid: [23079,1] App: 0 Process rank: 0 Bound: UNBOUND
    Process OMPI jobid: [23079,1] App: 0 Process rank: 1 Bound: UNBOUND
    Process OMPI jobid: [23079,1] App: 0 Process rank: 2 Bound: UNBOUND
    Process OMPI jobid: [23079,1] App: 0 Process rank: 3 Bound: UNBOUND

 =============================================================
[0 : output] Output to be read-from/written-into folder '.', with prefix 'gaussian3d_mcmc'
[0 : output] Found existing info files with the requested output prefix: 'gaussian3d_mcmc'
[0 : output] Will delete previous products ('force' was requested).
[0 : gaussian3d] Initialized external likelihood.
[1 : gaussian3d] Initialized external likelihood.
[3 : gaussian3d] Initialized external likelihood.
[2 : gaussian3d] Initialized external likelihood.
[1 : mcmc] Getting initial point... (this may take a few seconds)
[1 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[3 : mcmc] Getting initial point... (this may take a few seconds)
[3 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[2 : mcmc] Getting initial point... (this may take a few seconds)
[2 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[0 : mcmc] Getting initial point... (this may take a few seconds)
[0 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[0 : model] Measuring speeds... (this may take a few seconds)
[0 : model] Setting measured speeds (per sec): {gaussian3d: 4880.0}
[1 : mcmc] Initial point: x1:-3.979083, x2:-0.2152203, x3:-0.9680011
[0 : mcmc] Initial point: x1:4.15595, x2:-3.426034, x3:-1.24349
[2 : mcmc] Initial point: x1:0.5625395, x2:-3.158911, x3:5.099773
[3 : mcmc] Initial point: x1:-2.697611, x2:0.7309915, x3:5.063762
[0 : mcmc] Covariance matrix not present. We will start learning the covariance of the proposal earlier: R-1 = 30 (would be 2 if all params loaded).
[0 : mcmc] Sampling!
[1 : mcmc] Progress @ 2023-09-27 13:21:00 : 1 steps taken, and 0 accepted.
[0 : mcmc] Progress @ 2023-09-27 13:21:00 : 1 steps taken, and 0 accepted.
[3 : mcmc] Progress @ 2023-09-27 13:21:00 : 1 steps taken, and 0 accepted.
[2 : mcmc] Progress @ 2023-09-27 13:21:00 : 1 steps taken, and 0 accepted.
[2 : mcmc] Learn + convergence test @ 600 samples accepted.
[2 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[3 : mcmc] Learn + convergence test @ 600 samples accepted.
[3 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] Learn + convergence test @ 600 samples accepted.
[0 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[1 : mcmc] Learn + convergence test @ 600 samples accepted.
[1 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] All chains are ready to check convergence and learn a new proposal covmat
[2 : mcmc] ----- in check_convergence_and_learn_proposal: mean = [ 0.11959765 -0.06699488 -0.12207036]
[0 : mcmc] ----- in check_convergence_and_learn_proposal: mean = [ 0.15766087 -0.22653836  0.23059055]
[3 : mcmc] ----- in check_convergence_and_learn_proposal: mean = [ 0.12888131 -0.23066238  0.0547595 ]
[0 : mcmc] ----- in check_convergence_and_learn_proposal: cov[0] = [ 0.94817878 -0.01719196 -0.00731992]
[2 : mcmc] ----- in check_convergence_and_learn_proposal: cov[0] = [ 1.05108926 -0.05385018 -0.00650687]
[0 : mcmc] ----- in check_convergence_and_learn_proposal: acceptance_rate = 0.5418894830659536
[2 : mcmc] ----- in check_convergence_and_learn_proposal: acceptance_rate = 0.5295081967213114
[3 : mcmc] ----- in check_convergence_and_learn_proposal: cov[0] = [ 0.94285667 -0.14810166  0.07280326]
[3 : mcmc] ----- in check_convergence_and_learn_proposal: acceptance_rate = 0.5261382799325464
[3 : mcmc] ----- in check_convergence_and_learn_proposal: Ns = [623]
[2 : mcmc] ----- in check_convergence_and_learn_proposal: Ns = [646]
[3 : mcmc] ----- in check_convergence_and_learn_proposal: means = [[ 0.12888131 -0.23066238  0.0547595 ]]
[2 : mcmc] ----- in check_convergence_and_learn_proposal: means = [[ 0.11959765 -0.06699488 -0.12207036]]
[1 : mcmc] Learn + convergence test @ 1200 samples accepted.
[1 : mcmc] Learn + convergence test @ 1800 samples accepted.

[...]

[1 : mcmc] Learn + convergence test @ 30600 samples accepted.
[1 : mcmc] *ERROR* Waiting for too long for all chains to be ready. Maybe one of them is stuck or died unexpectedly?
[1 : mcmc] Aborting MPI due to error

Working example

(py3115env) [lukas@nefertari gaussian]$ mpirun --display-map --map-by ppr:3:socket:pe=2 cobaya-run -f gaussian3d_mcmc.yaml
 Data for JOB [24574,1] offset 0 Total slots allocated 8

 ========================   JOB MAP   ========================

 Data for node: nefertari   Num slots: 8    Max slots: 0    Num procs: 3
    Process OMPI jobid: [24574,1] App: 0 Process rank: 0 Bound: socket 0[core 0[hwt 0-1]], socket 0[core 1[hwt 0-1]]:[BB/BB/../../../../../..]
    Process OMPI jobid: [24574,1] App: 0 Process rank: 1 Bound: socket 0[core 2[hwt 0-1]], socket 0[core 3[hwt 0-1]]:[../../BB/BB/../../../..]
    Process OMPI jobid: [24574,1] App: 0 Process rank: 2 Bound: socket 0[core 4[hwt 0-1]], socket 0[core 5[hwt 0-1]]:[../../../../BB/BB/../..]

 =============================================================
[0 : output] Output to be read-from/written-into folder '.', with prefix 'gaussian3d_mcmc'
[0 : output] Found existing info files with the requested output prefix: 'gaussian3d_mcmc'
[0 : output] Will delete previous products ('force' was requested).
[0 : gaussian3d] Initialized external likelihood.
[1 : gaussian3d] Initialized external likelihood.
[2 : gaussian3d] Initialized external likelihood.
[1 : mcmc] Getting initial point... (this may take a few seconds)
[1 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[2 : mcmc] Getting initial point... (this may take a few seconds)
[2 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[0 : mcmc] Getting initial point... (this may take a few seconds)
[0 : prior] Reference values or pdfs for some parameters were not provided. Sampling from the prior instead for those parameters.
[0 : model] Measuring speeds... (this may take a few seconds)
[0 : model] Setting measured speeds (per sec): {gaussian3d: 5520.0}
[2 : mcmc] Initial point: x1:-3.897728, x2:-4.78303, x3:-4.526369
[1 : mcmc] Initial point: x1:-3.735137, x2:-2.215487, x3:1.189426
[0 : mcmc] Initial point: x1:3.635579, x2:-3.021058, x3:3.895426
[0 : mcmc] Covariance matrix not present. We will start learning the covariance of the proposal earlier: R-1 = 30 (would be 2 if all params loaded).
[0 : mcmc] Sampling!
[2 : mcmc] Progress @ 2023-09-27 13:27:18 : 1 steps taken, and 0 accepted.
[0 : mcmc] Progress @ 2023-09-27 13:27:18 : 1 steps taken, and 0 accepted.
[1 : mcmc] Progress @ 2023-09-27 13:27:18 : 1 steps taken, and 0 accepted.
[1 : mcmc] Learn + convergence test @ 600 samples accepted.
[1 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[2 : mcmc] Learn + convergence test @ 600 samples accepted.
[2 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] Learn + convergence test @ 600 samples accepted.
[0 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] All chains are ready to check convergence and learn a new proposal covmat
[0 : mcmc] ----- in check_convergence_and_learn_proposal: mean = [-0.02972704 -0.11121315 -0.05255849]
[2 : mcmc] ----- in check_convergence_and_learn_proposal: mean = [-0.15458871  0.14742725  0.06774566]
[1 : mcmc] ----- in check_convergence_and_learn_proposal: mean = [0.00941894 0.07194892 0.1138493 ]
[0 : mcmc] ----- in check_convergence_and_learn_proposal: cov[0] = [ 1.01846678e+00 -5.59099288e-04  1.48497610e-01]
[0 : mcmc] ----- in check_convergence_and_learn_proposal: acceptance_rate = 0.5102040816326531
[2 : mcmc] ----- in check_convergence_and_learn_proposal: cov[0] = [ 9.56064357e-01 -1.17224796e-01  6.95635452e-04]
[2 : mcmc] ----- in check_convergence_and_learn_proposal: acceptance_rate = 0.5472370766488414
[1 : mcmc] ----- in check_convergence_and_learn_proposal: cov[0] = [ 1.08986011 -0.01241736  0.05298533]
[1 : mcmc] ----- in check_convergence_and_learn_proposal: acceptance_rate = 0.5107438016528926
[2 : mcmc] ----- in check_convergence_and_learn_proposal: Ns = [613]
[2 : mcmc] ----- in check_convergence_and_learn_proposal: means = [[-0.15458871  0.14742725  0.06774566]]
[1 : mcmc] ----- in check_convergence_and_learn_proposal: Ns = [618]
[1 : mcmc] ----- in check_convergence_and_learn_proposal: means = [[0.00941894 0.07194892 0.1138493 ]]
[0 : mcmc] ----- in check_convergence_and_learn_proposal: Ns = [600 618 613]
[0 : mcmc] ----- in check_convergence_and_learn_proposal: means = [[-0.02972704 -0.11121315 -0.05255849]
 [ 0.00941894  0.07194892  0.1138493 ]
 [-0.15458871  0.14742725  0.06774566]]
[0 : mcmc]  - Acceptance rate: 0.523 = avg([0.5102040816326531, 0.5107438016528926, 0.5472370766488414])
[0 : mcmc]  - Convergence of means: R-1 = 0.025534 after 1831 accepted steps = sum([600, 618, 613])
[0 : mcmc]  - Updated covariance matrix of proposal pdf.
[2 : mcmc] Learn + convergence test @ 1200 samples accepted.
[2 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[0 : mcmc] Learn + convergence test @ 1200 samples accepted.
[0 : mcmc] Ready to check convergence and learn a new proposal covmat (waiting for the rest...)
[1 : mcmc] Learn + convergence test @ 1200 samples accepted.

[...]

[0 : mcmc] The run has converged!
[0 : mcmc] Sampling complete after 7353 accepted steps.
cmbant commented 1 year ago

Thanks. So my interpretation of the log is that rank 1 is the last one to be ready, but then rank 1 never gets past all_ready and just continues computing without ever calling check_convergence_and_learn_proposal (where the other ranks wait).

So the question is why does rank 1 never think all states are READY, even though READY has been sent from all ranks. Setting mpi.log to a logger may help to get some handle on that.

It's presumably something to do with async handling and how progress threads are assigned CPU. I can now see something similar on Windows with enough chains.

cmbant commented 1 year ago

This is odd, but can you try https://github.com/CobayaSampler/cobaya/pull/317 which seems to fix similar issue on my system?

lukashergt commented 1 year ago

Thanks. So my interpretation of the log is that rank 1 is the last one to be ready, but then rank 1 never gets past all_ready and just continues computing without ever calling check_convergence_and_learn_proposal (where the other ranks wait).

Nice catch, I ran it another few times and managed to get other ranks to hang. Indeed, the last rank to be ready never calls the All ready.

lukashergt commented 1 year ago

This is odd, but can you try #317 which seems to fix similar issue on my system?

Wow, nice, this seems to fix things, can't get it to hang anymore :)

cmbant commented 1 year ago

Great, thanks for the detailed reports.


From: Lukas Hergt @.> Sent: Friday, September 29, 2023 6:51:40 PM To: CobayaSampler/cobaya @.> Cc: Antony Lewis @.>; Comment @.> Subject: Re: [CobayaSampler/cobaya] MPI MCMC hanging (Issue #308)

This is odd, but can you try #317https://github.com/CobayaSampler/cobaya/pull/317 which seems to fix similar issue on my system?

Wow, nice, this seems to fix things, can't get it to hang anymore :)

— Reply to this email directly, view it on GitHubhttps://github.com/CobayaSampler/cobaya/issues/308#issuecomment-1741280987, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA6GQI4V4FQLLRMDB7DKEILX44DCZANCNFSM6AAAAAA3TMJOAI. You are receiving this because you commented.Message ID: @.***>

vivianmiranda commented 1 year ago

Uau - thanks for all the work!

In my case I was able to fix by using vader when chains are running in a single node - but I will definitely check out the new fix --mca btl vader,tcp,self