Parallel-in-Time / PFASST

PFASST++ is a C++ implementation of the "parallel full approximation scheme in space and time" (PFASST) algorithm
http://www.parallelintime.org/PFASST/
Other
32 stars 14 forks source link

PFASST hangs in random deadlocks #210

Closed torbjoernk closed 9 years ago

torbjoernk commented 9 years ago

Sometimes, PFASST hangs in random deadlocks.

For example with

mpiexec -np 8 examples/advection_diffusion/mpi_pfasst --spatial_dofs 4096 --dt 0.01 --tend 2.56 --num_iter 30 --abs_res_tol 1e-10 -c

it usually hangs around time step 86 and with

mpiexec -np 4 examples/advection_diffusion/mpi_pfasst --spatial_dofs 4096 --dt 0.01 --tend 25.6 --num_iter 30 --abs_res_tol 1e-10 -c

it hangs around time step 416.

memmett commented 9 years ago

@torbjoernk Which MPI implementation are you using? Let me guess: OpenMPI?

pancetta commented 9 years ago

Seems to appear on JUQUEEN as well, no OpenMPI here I think.

On 05.05.15 18:11, Matthew Emmett wrote:

@torbjoernk https://github.com/torbjoernk Which MPI implementation are you using? Let me guess: OpenMPI?

— Reply to this email directly or view it on GitHub https://github.com/Parallel-in-Time/PFASST/issues/210#issuecomment-99128086.

memmett commented 9 years ago

@pancetta Do you have an llsubmit script lying around that I can use to test this?

pancetta commented 9 years ago

@torbjoernk could you pack one of these setups and send it to @memmett ?

On 05.05.15 20:03, Matthew Emmett wrote:

@pancetta https://github.com/pancetta Do you have an |llsubmit| script lying around that I can use to test this?

— Reply to this email directly or view it on GitHub https://github.com/Parallel-in-Time/PFASST/issues/210#issuecomment-99160325.

memmett commented 9 years ago

@torbjoernk while you are at it, could you also fix this bug and let @danielru know when you are done? i'm sure @pancetta will be pleased. yours truly, @memmett.

pancetta commented 9 years ago

+1

On 05.05.15 20:14, Matthew Emmett wrote:

@torbjoernk https://github.com/torbjoernk while you are at it, could you also fix this bug and let @danielru https://github.com/danielru know when you are done? i'm sure @pancetta https://github.com/pancetta will be pleased. yours truly, @memmett https://github.com/memmett.

— Reply to this email directly or view it on GitHub https://github.com/Parallel-in-Time/PFASST/issues/210#issuecomment-99164020.

danielru commented 9 years ago

I am following this.... no idea if this helps in any way, but I remember that I saw similar behavior in my Fortran Parareal quite some time ago - if I remember correctly, it had something to do with the datatypes for the MPI requests in non-blocking sends and receives. I stored them in arrays but used only the first N entries and MPI did not like that. Unfortunately I can't remember the details.

This was, however, in Fortran, so no idea if this is in any way helpful.

torbjoernk commented 9 years ago

So I've done a little debugging and used a custom installation of MPICH with a few additional debug symbols. Both of above setups do terminate correctly but at the end I'm getting messages as the following:

In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 330 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated
In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 25 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated
In direct memory block for handle type REQUEST, 1 handles are still allocated
In direct memory block for handle type COMM, 2 handles are still allocated
In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 25 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated
In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 55 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated
In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 25 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated
In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 25 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated

There are a couple of non-blocking sends and/or receives which are never completed. I'm now trying to track these down.

torbjoernk commented 9 years ago

A little more debugging data. I logged every call of status communication (send and receive):

$> grep -c 'sending converged status' mpi-rank-000*.log
mpi-rank-0000.log:91
mpi-rank-0001.log:101
mpi-rank-0002.log:107
mpi-rank-0003.log:169
mpi-rank-0004.log:174
mpi-rank-0005.log:220
mpi-rank-0006.log:289
mpi-rank-0007.log:0

vs

$> grep -c 'recieved converged status' mpi-rank-000*.log
mpi-rank-0000.log:0
mpi-rank-0001.log:91
mpi-rank-0002.log:101
mpi-rank-0003.log:97
mpi-rank-0004.log:169
mpi-rank-0005.log:174
mpi-rank-0006.log:220
mpi-rank-0007.log:187

I'm trying to free my head over the weekend to rethink that whole communication on Monday. Feel free to play around in my debugging branch feature/debug (torbjoernk/PFASST@cd4b7504a770436b549c56920ae595870cbf7ac7).

That branch has also a new easy integration of mpiP to print MPI statistics for debugging purposes.

danielru commented 9 years ago

Just out of curiosity, did you check whether it works with blocking SEND and RECV?

memmett commented 9 years ago

Status communication is always blocking...

On Fri, May 8, 2015 at 8:15 AM, Daniel Ruprecht notifications@github.com wrote:

Just out of curiosity, did you check whether it works with blocking SEND and RECV?

— Reply to this email directly or view it on GitHub https://github.com/Parallel-in-Time/PFASST/issues/210#issuecomment-100244637 .

torbjoernk commented 9 years ago

And because the status communication is blocking I'm really puzzled about these mismatches. Either I'm way too stupid for this or something is going really wrong.

pancetta commented 9 years ago

But the communication of volume (spatial) data on all but the coarsest level is non-blocking, right?

On 09.05.15 14:36, Torbjörn Klatt wrote:

And because the status communication is blocking I'm really puzzled about these mismatches. Either I'm way too stupid for this or something is going really wrong.

— Reply to this email directly or view it on GitHub https://github.com/Parallel-in-Time/PFASST/issues/210#issuecomment-100474050.

memmett commented 9 years ago

I can't recreate the hangs locally, but I can see mismatched status send/recv. I'll look into it today...

torbjoernk commented 9 years ago

I guess, I found and fixed the problem. At least the given setup does not hang any more on my machine. The number of open requests was too high. By assuring to cancel and free hanging non-blocking requests when initiating new non-blocking ones. I'll prepare a PR for this.

memmett commented 9 years ago

Fixed by #212.