Closed gassmoeller closed 1 year ago
Reopening. We do not know yet if #4986 fixes this issue. Github automation was a bit overzealous.
I have no suggestion at this point (agree will all of your points). Maybe we are having a problem with an exception.
Maybe @peterrum has an idea...
Additional information today:
- We likely need similar guards for that as here.
Agreed. That is what I would do as well. Let me know if you need help.
This does not yet tell us why the exception is thrown, it only removes the deadlock.
Any idea where the exception is thrown? I guess it the if the collective mutex cause the a deadlock, this means that the content guarded by the mutex causes a problem!?
Any idea where the exception is thrown?
My current best guess is somewhere between here and here (lines 1606-1628). Some processes clearly reach the MPI_IBarrier inside signal_finish
, while others do not (or at least some processes are later waiting for someone to reach the IBarrier).
Because of all the characteristics above I suspect a resource leak (reproducible at a fixed timestep, but when restarting we can progress past the crashed timestep and get stuck later).
I will let you know once I have some more information, we are currently testing dealii/dealii#14356 to see if it improves our error message.
Using dealii/dealii#14364 we made some progress in tracking down the problem. Still working on digging deeper, but here is what we know now:
The one exception we have tracked down (there seems to be another one that still happens) is a boost exception that is raised from inside Utilities::unpack
in this line.
The exception we saw is:
----------------------------------------------------
Exception on rank 20 on processing:
no read access: iostream error
Aborting!
----------------------------------------------------
Which is raised from inside Boost. @RanpengLi and I went in with gdb and found that apparently an empty std::vector<char>
(length zero) is about to be unpacked into a std::vector<std::pair<unsigned int, unsigned int>
. It looked like both cbegin
and cend
the input arguments to unpack
are null pointers. I added a fix to Utilities::unpack
to just return a default constructed object if cbegin == cend
(which should be safe even with null pointers). This seems to have fixed most exceptions, but at least one process is still throwing an exception and we are currently tracking that down.
I attach the backtrace in case someone has an idea. exception_backtrace.txt
Currently open questions:
cbegin
and/or cend
are null. Should we do something about that? What should unpack
do if there is no data to unpack? Return a default constructed empty object? Or crash?I went in with gdb and found that apparently an empty
std::vector<char>
(length zero) is about to be unpacked into astd::vector<std::pair<unsigned int, unsigned int>
. It looks like boost may not like the initialization in this line
Yes, de-referencing an empty range is undefined behavior. I will play with this a little bit and report back.
I tried compressing an empty std::vector<std::pair>
but it will compress to 53 bytes not to 0 bytes. So, to me it sounds like receiving an empty buffer should be a bug.
@gassmoeller I'm not sure this is fixed. Do we need to keep this open here? Is there anything we can/need to do in ASPECT?
I do not know if it is fixed, but @RanpengLi reported that after a system update our cluster doesnt show the problem anymore so it might have been a configuration problem (or an interaction of the library versions we used). We can close the issue for now.
As discussed yesterday with @tjhei we (@jdannberg, @RanpengLi and myself) are investigating a problem that looks like a MPI dead lock inside the GMG preconditioner. We are still working to break it down to a simple model (we currently need 64-128 processes and more than 10 hours of runtime to reproduce). I attach a log.txt and two stack traces of two different processes below.
Observations so far:
Analyzing the stacktrace shows the following:
Partitioner::set_ghost_indices
, which callsConsensusAlgorithms::NBX::run
(just a simplified summary of the stacktrace).all_remotely_originated_receives_are_completed()
here.Things we test at the moment:
Other ideas to test are appreciated. I will update this issue when we find out more.
test.48589940.log.txt 126267.txt 126279.txt