adrn / schwimmbad

A common interface to processing pools.
MIT License
115 stars 18 forks source link

use send rather than ssend to avoid lockup #52

Closed dstndstn closed 2 months ago

dstndstn commented 6 months ago

Hi,

I'm using Ubuntu 20.04, with the OS openmpi package, mpi4py 3.1.5, and schwimmbad 0.3.2. This is on the "symmetry" cluster at Perimeter Institute.

The behavior I'm seeing is that when creating an MPIPool(), I see each worker getting one task, it finishes the task and sends the result back, and the boss receives the result, but the workers never proceed to the next task.

Via some sophisticated printf debugging, I found that the workers were never returning from the self.comm.ssend() call. My wise colleague suggested changing that to self.comm.send(), and then it works perfectly!

I don't think you need any of the synchronization implied by ssend, so this should be fine?

My system details:

$ mpiexec --version mpiexec (OpenRTE) 4.0.3 $ ls -l $(which mpiexec) lrwxrwxrwx 1 root root 25 Aug 15 2023 /usr/bin/mpiexec -> /etc/alternatives/mpiexec $ ls -l /etc/alternatives/mpiexec lrwxrwxrwx 1 root root 24 Aug 15 2023 /etc/alternatives/mpiexec -> /usr/bin/mpiexec.openmpi

adrn commented 5 months ago

Hey! I'm just getting back to work from parental leave, but I'll take a look at this within the next few weeks. Thanks for this!

adrn commented 2 months ago

Whoops, where did those months go? Thanks for the patience -- I haven't seen the issue you described, but I also don't know why this was using ssend to begin with (it probably traces back to ye olde MPIPool implementation in emcee, where some of this all started...). So I'm find with changing it to the more standard send! Thanks for catching.