Closed chrisrichardson closed 3 years ago
Thank you for the bug report.
This is confirmed using g++9 and openmpi so not compiler/mpi library specific.
The segfault is triggered by the fetch_points/fetch_all call in your example but in fact stems from the creation of the internal message structure within the library triggered by the barrier() function, so is likely representative of a more deeply-rooted bug.
This will be looked into as a matter of urgency and this report updated once a fix is found.
A quick point for the moment, there is a mismatch in the chrono_sampler used, when using the specialisms approach to declaring MUI objects, it is important to match types, the interface and point types are 3d (3-dimensional and double data type) while the chrono_sampler is 1d. This isn't the underlying problem though.
Thanks @SLongshaw. I will update my chrono_sampler dimensionality. Good luck finding the bug.
Hi @chrisrichardson I think this should now be resolved in both branches.
To summarise, you managed to uncover a situation that doesn't normally arise using MUI, specifically the sending rank was able to exit before the receiving rank had completed its fetch and as MUI relies on a non-blocking send / blocking receive design this meant MPI buffers etc. were lost as the sending rank exited. Normally coupled designs mean this tends not to happen (not specifically by design admittedly).
In reality though, this represented a bug in the general design of MUI as there should always be a corresponding MPI wait (or equivalent) call (which there was) but there was nothing to block to wait to ensure completion.
The solution implemented is a new blocking wait on MPI_Isend completion but with an inbuilt timer to avoid deadlock.
The problem here is that the original design was the safest and most general approach but inherently had this potential problem, the new design is more "correct" in MPI terms but reduces the generality of the library, for most scenarios this shouldn't represent any difference but for those few where it does, the addition of a timeout and user warning messages is the (inelegant) solution.
This change will remain permanent as long as further test cases in the future don't highlight any unforeseen major issues with the new solution. In that case there is a work-around using the original approach and that is to ensure that the sending side finishes using a call to the barrier() function at a time stamp where it hasn't yet issued a fetch() command and, on the receiving side to send a final empty commit() at that time to unblock both sides, in effect adding a synchronisation barrier.
Please try this change out and feel free to re-open this issue if it is not resolved for you, for now though I will close.
The fix for this problem eventually resulted in a runtime problem with a specific case where the new blocking wait adding significant overhead. The non-blocking wait has therefore been re-instated, however, a final blocking test loop has been added to the destructor of the MPI comm class, this has been tested against the issue raised here and the new solution still offers a fix while removing the need for a blocking MPI test loop after every send.
I have been trying to adapt example 8 (fetchall.cpp) to my purposes, but found the following strange behaviour. When running the code below with
mpirun -np 1 ./fetchall mpi://domain1/ifs 0.618 : -np 1 mpi://domain2/ifs 1.414
it will crash with:However, it works OK for smaller amounts of data (e.g. n=1000)
Code (minimum example based on fetchall.cpp) compiled with g++10, mpich and using same Makefile from fetchall example