Closed torbjoernk closed 9 years ago
@torbjoernk Which MPI implementation are you using? Let me guess: OpenMPI?
Seems to appear on JUQUEEN as well, no OpenMPI here I think.
On 05.05.15 18:11, Matthew Emmett wrote:
@torbjoernk https://github.com/torbjoernk Which MPI implementation are you using? Let me guess: OpenMPI?
— Reply to this email directly or view it on GitHub https://github.com/Parallel-in-Time/PFASST/issues/210#issuecomment-99128086.
@pancetta Do you have an llsubmit
script lying around that I can use to test this?
@torbjoernk could you pack one of these setups and send it to @memmett ?
On 05.05.15 20:03, Matthew Emmett wrote:
@pancetta https://github.com/pancetta Do you have an |llsubmit| script lying around that I can use to test this?
— Reply to this email directly or view it on GitHub https://github.com/Parallel-in-Time/PFASST/issues/210#issuecomment-99160325.
@torbjoernk while you are at it, could you also fix this bug and let @danielru know when you are done? i'm sure @pancetta will be pleased. yours truly, @memmett.
+1
On 05.05.15 20:14, Matthew Emmett wrote:
@torbjoernk https://github.com/torbjoernk while you are at it, could you also fix this bug and let @danielru https://github.com/danielru know when you are done? i'm sure @pancetta https://github.com/pancetta will be pleased. yours truly, @memmett https://github.com/memmett.
— Reply to this email directly or view it on GitHub https://github.com/Parallel-in-Time/PFASST/issues/210#issuecomment-99164020.
I am following this.... no idea if this helps in any way, but I remember that I saw similar behavior in my Fortran Parareal quite some time ago - if I remember correctly, it had something to do with the datatypes for the MPI requests in non-blocking sends and receives. I stored them in arrays but used only the first N entries and MPI did not like that. Unfortunately I can't remember the details.
This was, however, in Fortran, so no idea if this is in any way helpful.
So I've done a little debugging and used a custom installation of MPICH with a few additional debug symbols. Both of above setups do terminate correctly but at the end I'm getting messages as the following:
In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 330 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated
In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 25 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated
In direct memory block for handle type REQUEST, 1 handles are still allocated
In direct memory block for handle type COMM, 2 handles are still allocated
In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 25 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated
In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 55 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated
In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 25 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated
In direct memory block for handle type REQUEST, 8 handles are still allocated
In indirect memory block 0 for handle type REQUEST, 25 handles are still allocated
In direct memory block for handle type COMM, 1 handles are still allocated
There are a couple of non-blocking sends and/or receives which are never completed. I'm now trying to track these down.
A little more debugging data. I logged every call of status communication (send and receive):
$> grep -c 'sending converged status' mpi-rank-000*.log
mpi-rank-0000.log:91
mpi-rank-0001.log:101
mpi-rank-0002.log:107
mpi-rank-0003.log:169
mpi-rank-0004.log:174
mpi-rank-0005.log:220
mpi-rank-0006.log:289
mpi-rank-0007.log:0
vs
$> grep -c 'recieved converged status' mpi-rank-000*.log
mpi-rank-0000.log:0
mpi-rank-0001.log:91
mpi-rank-0002.log:101
mpi-rank-0003.log:97
mpi-rank-0004.log:169
mpi-rank-0005.log:174
mpi-rank-0006.log:220
mpi-rank-0007.log:187
I'm trying to free my head over the weekend to rethink that whole communication on Monday. Feel free to play around in my debugging branch feature/debug
(torbjoernk/PFASST@cd4b7504a770436b549c56920ae595870cbf7ac7).
That branch has also a new easy integration of mpiP to print MPI statistics for debugging purposes.
Just out of curiosity, did you check whether it works with blocking SEND and RECV?
Status communication is always blocking...
On Fri, May 8, 2015 at 8:15 AM, Daniel Ruprecht notifications@github.com wrote:
Just out of curiosity, did you check whether it works with blocking SEND and RECV?
— Reply to this email directly or view it on GitHub https://github.com/Parallel-in-Time/PFASST/issues/210#issuecomment-100244637 .
And because the status communication is blocking I'm really puzzled about these mismatches. Either I'm way too stupid for this or something is going really wrong.
But the communication of volume (spatial) data on all but the coarsest level is non-blocking, right?
On 09.05.15 14:36, Torbjörn Klatt wrote:
And because the status communication is blocking I'm really puzzled about these mismatches. Either I'm way too stupid for this or something is going really wrong.
— Reply to this email directly or view it on GitHub https://github.com/Parallel-in-Time/PFASST/issues/210#issuecomment-100474050.
I can't recreate the hangs locally, but I can see mismatched status send/recv. I'll look into it today...
I guess, I found and fixed the problem. At least the given setup does not hang any more on my machine. The number of open requests was too high. By assuring to cancel and free hanging non-blocking requests when initiating new non-blocking ones. I'll prepare a PR for this.
Fixed by #212.
Sometimes, PFASST hangs in random deadlocks.
For example with
it usually hangs around time step 86 and with
it hangs around time step 416.