JohannesBuchner / autoemcee

Run MCMC automatically to convergence
Other
8 stars 0 forks source link

Issue about mpi4py with MCMC autoemcee library #2

Open montechris1 opened 11 months ago

montechris1 commented 11 months ago

Description

I am using a MCMC code (Monte-Carlo Markov-Chain) to test a model and see if it is validated.

I launch this MCMC with MPI for Python using 64 processors.

This MCMC is called automecee : https://johannesbuchner.github.io/autoemcee/index.html

Once convergence is reached, I have the following error that seems to come from mpi4py :

Capture d’écran 2023-08-14 à 10 01 22

What I Did

Here the command that I have executed :

$ time mpirun -np 64 python3.9 BD_MCMC_autoemcee_SENT_TO_DAVID_maybe_issue_of_units_18_JUIN_2023.py

I don't know what happened exactly.

Even if I decrease the max_ncalls to 100000000 this way :

sampler.run(max_ncalls=100000000, rhat_max=1.00001)

I get a similar error. It seems to be related to the number of processes handling by mpi4py (line 363 of autoemcee.py):

Capture d’écran 2023-08-12 à 15 45 42

but I can't explain why it fails.

Everything seems to be like the code can't gather all the 64 MCMC chains or can't wait that all processes have stopped , it is weird.

We can wonder if there may be an BigInteger/long conversion issue or if rhat is taken too small ( 1e-5 in my example).

Maybe someone could see what's wrong, I would be grateful to have any clues or workaround.

JohannesBuchner commented 11 months ago

Googling the error gives these:

Can you try upgrading MPI to >4 or using the upper case methods?

montechris1 commented 11 months ago

ok thanks, I am going to upgrade MPI > 4 and do another run ( this should run for several hours).

I keep you updated,

Best regards, Chris

Le lun. 14 août 2023 à 13:18, Johannes Buchner @.***> a écrit :

Googling the error gives these:

- https://stackoverflow.com/questions/20023742/how-do-i-remove-the-memory-limit-on-openmpi-processes

Can you try upgrading MPI to >4 or using the upper case methods?

— Reply to this email directly, view it on GitHub https://github.com/JohannesBuchner/autoemcee/issues/2#issuecomment-1677136088, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB467V7CXHFPNGF555SQZY3XVICPNANCNFSM6AAAAAA3PKUZ2A . You are receiving this because you authored the thread.Message ID: @.***>

montechris1 commented 11 months ago

Hi Johannes,

I am trying to run the code with Intel MPI :

$ mpirun --version Intel(R) MPI Library for Linux* OS, Version 2021.10 Build 20230619 (id: c2e19c2f3e) Copyright 2003-2023, Intel Corporation.

Do you think it is a version that supports MPI > 4.0 ?

By the way, from what your links you sent, do you think it is a MPI problem or a MPI4PY issue ? I didn't grasp the link between both since I have installed them with Intel Conda.

Does an upgrade of Intel MPI automatically imply an upgrade of Intel MPI4PY with Intel conda package manager ?

Best regards

montechris1 commented 11 months ago

I have just finished the run with Intel MPI > 4 and error persists.

So I am going to try with "upper case methods" :

Instead of having at line around 363 of autoemcee.py :

  if self.use_mpi:
                recv_chains = self.comm.gather(chains, root=0)
                chains = np.concatenate(self.comm.bcast(recv_chains, root=0))

Should I have to put :

  if self.use_mpi:
                recv_chains = self.comm.Gather(chains, root=0)
                chains = np.concatenate(self.comm.Bcast(recv_chains, root=0))

?

i.e set upper case on comm.gather and comm.bcast ?

Anyway, I going to try this, run is long, I keep you updated.

Regards

montechris1 commented 11 months ago

it seems that Intel MPI has not all the functionalities of Open-MPI > 4.0 .

Here the current Intel MPI version that I have used up to now :

$ mpirun --version Intel(R) MPI Library for Linux* OS, Version 2021.10 Build 20230619 (id: c2e19c2f3e) Copyright 2003-2023, Intel Corporation.

That's why I am doing now a running with Open-MPI > 4 and not Intel MPI.

Regards