geodynamics / aspect

A parallel, extensible finite element code to simulate convection in both 2D and 3D models.
https://aspect.geodynamics.org/
Other
227 stars 237 forks source link

Possible MPI Deadlock in GMG preconditioner #4984

Closed gassmoeller closed 1 year ago

gassmoeller commented 2 years ago

As discussed yesterday with @tjhei we (@jdannberg, @RanpengLi and myself) are investigating a problem that looks like a MPI dead lock inside the GMG preconditioner. We are still working to break it down to a simple model (we currently need 64-128 processes and more than 10 hours of runtime to reproduce). I attach a log.txt and two stack traces of two different processes below.

Observations so far:

  1. The problem seems to be reproducible (running the same model stops in the same time step).
  2. The model seems to happen consistently for all models with similar parameters. We have not checked significantly different models so far.
  3. Restarting a stopped model allows us to continue running further than the previous stop point.
  4. Switching from GMG preconditioner to AMG preconditioner resolves the issue (we can run without deadlock models that would always stop for GMG).
  5. We can successfully run the same model on different hardware, compiler, and MPI versions, while using the same ASPECT and deal.II. Environment that crashes: Intel 19.1.0.166 / openmpi 4.1.1. Environment that works: GCC 9.4.0 / openmpi 4.0.3.

Analyzing the stacktrace shows the following:

  1. Both processes are stuck within AffineConstraints::distribute, when creating a new distributed vector and calling Partitioner::set_ghost_indices, which calls ConsensusAlgorithms::NBX::run (just a simplified summary of the stacktrace).
  2. However the two processes are stuck in different places that should not be simultaneously reachable:
    • One is stuck in the MPI_Barrier in the destructor of the ScopedLock created here. This suggests to me this process is done with the function and in the process of jumping back to the calling function.
    • The other is stuck in MPI_Test and the only MPI_Test I found in the algorithm is inside all_remotely_originated_receives_are_completed() here.
    • However, the MPI_Test in question checks if all processes have completed the MPI_IBarrier that is placed further up in the function. So the first process must have passed this test already, while the second one is stuck. In other words, it looks like the return value of the MPI_Test makes some processes believe everyone has passed the MPI_IBarrier, while some others are not notified that the MPI_IBarrier has been passed by every process (and therefore wait endlessly for completion). I am not sure how this can happen (is it possible to get here is one process throws an exception inside the function?).

Things we test at the moment:

Other ideas to test are appreciated. I will update this issue when we find out more.

test.48589940.log.txt 126267.txt 126279.txt

gassmoeller commented 2 years ago

Reopening. We do not know yet if #4986 fixes this issue. Github automation was a bit overzealous.

tjhei commented 2 years ago

I have no suggestion at this point (agree will all of your points). Maybe we are having a problem with an exception.

Maybe @peterrum has an idea...

gassmoeller commented 2 years ago

Additional information today:

tjhei commented 2 years ago
  • We likely need similar guards for that as here.

Agreed. That is what I would do as well. Let me know if you need help.

peterrum commented 2 years ago

This does not yet tell us why the exception is thrown, it only removes the deadlock.

Any idea where the exception is thrown? I guess it the if the collective mutex cause the a deadlock, this means that the content guarded by the mutex causes a problem!?

gassmoeller commented 2 years ago

Any idea where the exception is thrown?

My current best guess is somewhere between here and here (lines 1606-1628). Some processes clearly reach the MPI_IBarrier inside signal_finish, while others do not (or at least some processes are later waiting for someone to reach the IBarrier).

Because of all the characteristics above I suspect a resource leak (reproducible at a fixed timestep, but when restarting we can progress past the crashed timestep and get stuck later).

I will let you know once I have some more information, we are currently testing dealii/dealii#14356 to see if it improves our error message.

gassmoeller commented 2 years ago

Using dealii/dealii#14364 we made some progress in tracking down the problem. Still working on digging deeper, but here is what we know now:

The one exception we have tracked down (there seems to be another one that still happens) is a boost exception that is raised from inside Utilities::unpack in this line.

The exception we saw is:

----------------------------------------------------
Exception on rank 20 on processing:
no read access: iostream error
Aborting!
----------------------------------------------------

Which is raised from inside Boost. @RanpengLi and I went in with gdb and found that apparently an empty std::vector<char> (length zero) is about to be unpacked into a std::vector<std::pair<unsigned int, unsigned int>. It looked like both cbegin and cend the input arguments to unpack are null pointers. I added a fix to Utilities::unpack to just return a default constructed object if cbegin == cend (which should be safe even with null pointers). This seems to have fixed most exceptions, but at least one process is still throwing an exception and we are currently tracking that down.

I attach the backtrace in case someone has an idea. exception_backtrace.txt

Currently open questions:

tjhei commented 2 years ago

I went in with gdb and found that apparently an empty std::vector<char> (length zero) is about to be unpacked into a std::vector<std::pair<unsigned int, unsigned int>. It looks like boost may not like the initialization in this line

Yes, de-referencing an empty range is undefined behavior. I will play with this a little bit and report back.

tjhei commented 2 years ago

I tried compressing an empty std::vector<std::pair> but it will compress to 53 bytes not to 0 bytes. So, to me it sounds like receiving an empty buffer should be a bug.

bangerth commented 1 year ago

@gassmoeller I'm not sure this is fixed. Do we need to keep this open here? Is there anything we can/need to do in ASPECT?

gassmoeller commented 1 year ago

I do not know if it is fixed, but @RanpengLi reported that after a system update our cluster doesnt show the problem anymore so it might have been a configuration problem (or an interaction of the library versions we used). We can close the issue for now.