mpi_pool crashes with uninformative error message

rodluger commented 9 years ago

I originally posted this on the emcee github but Dan mentioned I try it here. I'm on Red Hat 6.6 with a newly installed anaconda. I am running rather large parallel MCMC chains with ~100 dimensions, 500 walkers and 100,000 steps, and I run into this error once or twice per run. Any ideas?

File "/usr/lusers/rodluger/anaconda/lib/python2.7/site-packages/emcee/mpi_pool.py", line 95, in wait
  File "run.py", line 92, in Run
    pool.wait()                                                                     
  File "/usr/lusers/rodluger/anaconda/lib/python2.7/site-packages/emcee/mpi_pool.py", line 95, in wait
        task = self.comm.recv(source=0, tag=MPI.ANY_TAG, status=status)
    task = self.comm.recv(source=0, tag=MPI.ANY_TAG, status=status)
  File "Comm.pyx", line 816, in mpi4py.MPI.Comm.recv (src/mpi4py.MPI.c:66815)
  File "Comm.pyx", line 816, in mpi4py.MPI.Comm.recv (src/mpi4py.MPI.c:66815)
task = self.comm.recv(source=0, tag=MPI.ANY_TAG, status=status)
  File "Comm.pyx", line 816, in mpi4py.MPI.Comm.recv (src/mpi4py.MPI.c:66815)
    task = self.comm.recv(source=0, tag=MPI.ANY_TAG, status=status)
  File "Comm.pyx", line 816, in mpi4py.MPI.Comm.recv (src/mpi4py.MPI.c:66815)
  File "pickled.pxi", line 236, in mpi4py.MPI.PyMPI_recv (src/mpi4py.MPI.c:27858)
  File "pickled.pxi", line 236, in mpi4py.MPI.PyMPI_recv (src/mpi4py.MPI.c:27858)
  File "pickled.pxi", line 236, in mpi4py.MPI.PyMPI_recv (src/mpi4py.MPI.c:27858)
  File "pickled.pxi", line 236, in mpi4py.MPI.PyMPI_recv (src/mpi4py.MPI.c:27858)
mpi4py.MPI.Exceptionmpi4py.MPI.Exceptionmpi4py.MPI.Exception: : Other MPI error, error stack:
MPI_Probe(113).....................: MPI_Probe(src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0x7fff02e2cb50) failed
MPIDI_CH3I_Progress(432)...........: 
MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an unexpected message. 0 unexpected messages queued.
: Other MPI error, error stack:
MPI_Probe(113).....................: MPI_Probe(src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0x7fff2c96a520) failed
MPIDI_CH3I_Progress(432)...........: 
MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an unexpected message. 0 unexpected messages queued.
Other MPI error, error stack:
MPI_Probe(113).....................: MPI_Probe(src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0x7fff45daf9d0) failed
MPIDI_CH3I_Progress(432)...........: 
MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an unexpected message. 0 unexpected messages queued.
mpi4py.MPI.Exception: Other MPI error, error stack:
MPI_Probe(113).....................: MPI_Probe(src=0, tag=MPI_ANY_TAG, MPI_COMM_WORLD, status=0x7fff292b4220) failed
MPIDI_CH3I_Progress(432)...........: 
MPIDI_CH3_PktHandler_EagerSend(606): Failed to allocate memory for an unexpected message. 0 unexpected messages queued.

adrn commented 8 years ago

@rodluger Sorry I didn't get a notification for this -- I suck!

Still having this issue? I've never seen this before but it'd be good to hear if you have any updates.

rodluger commented 8 years ago

Hi Adrian,

No worries! I put this on the back burner a while ago -- no updates on my end. I asked the folks in charge of the cluster here and they'd never seen the issue, either. My workaround was to save the chain progress every hour or so and just restart it from the last savepoint whenever that happened, so it's not a big deal if this doesn't get resolved.

Thanks! Rodrigo

On Sat, Dec 12, 2015 at 1:08 AM Adrian Price-Whelan < notifications@github.com> wrote:

@rodluger https://github.com/rodluger Sorry I didn't get a notification for this -- I suck!

Still having this issue? I've never seen this before but it'd be good to hear if you have any updates.

— Reply to this email directly or view it on GitHub https://github.com/adrn/mpipool/issues/5#issuecomment-164104610.

adrn / mpipool

mpi_pool crashes with uninformative error message #5