Open montechris1 opened 1 year ago
Googling the error gives these:
Can you try upgrading MPI to >4 or using the upper case methods?
ok thanks, I am going to upgrade MPI > 4 and do another run ( this should run for several hours).
I keep you updated,
Best regards, Chris
Le lun. 14 août 2023 à 13:18, Johannes Buchner @.***> a écrit :
Googling the error gives these:
- https://stackoverflow.com/questions/20023742/how-do-i-remove-the-memory-limit-on-openmpi-processes
- mpi4py/mpi4py#23 https://github.com/mpi4py/mpi4py/issues/23
Can you try upgrading MPI to >4 or using the upper case methods?
— Reply to this email directly, view it on GitHub https://github.com/JohannesBuchner/autoemcee/issues/2#issuecomment-1677136088, or unsubscribe https://github.com/notifications/unsubscribe-auth/BB467V7CXHFPNGF555SQZY3XVICPNANCNFSM6AAAAAA3PKUZ2A . You are receiving this because you authored the thread.Message ID: @.***>
Hi Johannes,
I am trying to run the code with Intel MPI :
$ mpirun --version Intel(R) MPI Library for Linux* OS, Version 2021.10 Build 20230619 (id: c2e19c2f3e) Copyright 2003-2023, Intel Corporation.
Do you think it is a version that supports MPI > 4.0 ?
By the way, from what your links you sent, do you think it is a MPI problem or a MPI4PY issue ? I didn't grasp the link between both since I have installed them with Intel Conda.
Does an upgrade of Intel MPI automatically imply an upgrade of Intel MPI4PY with Intel conda package manager ?
Best regards
I have just finished the run with Intel MPI > 4 and error persists.
So I am going to try with "upper case methods" :
Instead of having at line around 363 of autoemcee.py :
if self.use_mpi:
recv_chains = self.comm.gather(chains, root=0)
chains = np.concatenate(self.comm.bcast(recv_chains, root=0))
Should I have to put :
if self.use_mpi:
recv_chains = self.comm.Gather(chains, root=0)
chains = np.concatenate(self.comm.Bcast(recv_chains, root=0))
?
i.e set upper case on comm.gather and comm.bcast ?
Anyway, I going to try this, run is long, I keep you updated.
Regards
it seems that Intel MPI has not all the functionalities of Open-MPI > 4.0 .
Here the current Intel MPI version that I have used up to now :
$ mpirun --version Intel(R) MPI Library for Linux* OS, Version 2021.10 Build 20230619 (id: c2e19c2f3e) Copyright 2003-2023, Intel Corporation.
That's why I am doing now a running with Open-MPI > 4 and not Intel MPI.
Regards
Description
I am using a MCMC code (Monte-Carlo Markov-Chain) to test a model and see if it is validated.
I launch this MCMC with MPI for Python using 64 processors.
This MCMC is called automecee : https://johannesbuchner.github.io/autoemcee/index.html
Once convergence is reached, I have the following error that seems to come from mpi4py :
What I Did
Here the command that I have executed :
$ time mpirun -np 64 python3.9 BD_MCMC_autoemcee_SENT_TO_DAVID_maybe_issue_of_units_18_JUIN_2023.py
I don't know what happened exactly.
Even if I decrease the max_ncalls to 100000000 this way :
sampler.run(max_ncalls=100000000, rhat_max=1.00001)
I get a similar error. It seems to be related to the number of processes handling by mpi4py (line 363 of autoemcee.py):
but I can't explain why it fails.
Everything seems to be like the code can't gather all the 64 MCMC chains or can't wait that all processes have stopped , it is weird.
We can wonder if there may be an BigInteger/long conversion issue or if rhat is taken too small ( 1e-5 in my example).
Maybe someone could see what's wrong, I would be grateful to have any clues or workaround.