Closed AlexanderKurtz closed 6 years ago
give up after a certain number of attempts ... 10 seconds
What if a computation is so unbalanced that one process directly goes into wait, and the other process does 20 sec of calculation? This still can be a regular, correctly written code which obviously progresses. To really catch deadlocks, it seems to me that a deadlock detection algorithm needs to be in place.
The maximum number of attempts is obviously configurable (send_attempts
and receive_attempts
in the configuration file), as is the time between attempts (send_delay
and receive_delay
)! The ~10 seconds are just the default!
Sure, but LAIK is a library. We always should have a default setting which allows correct applications to run until end.
This PR brings the final performance improvements for the TCP backend after extensive testing on the HimMUC. Overall, I am quite pleased with the the performance now; it's now roughly equivalent to MPI. I originally thought this was already the case, but it turns out that testing on my laptop (with essentially no network delays) and doing so on a real cluster (with very real network delays) are two very different things.
The performance improvements are mainly these four points:
backend.c
now does its send operations in parallel to its receive operations, which avoids the effective serialization of the previous approachminimpi.c
now does the sends for theMPI_Comm_split()
implementation in parallel to the receives, which avoids the same problem herebackend.c
no longer splits the its sends/receives in 1MiB chunks, but sends all the data in one go. This may consume more memory, but it allows to reduce the number of messages, which means better performance.There are also several stability improvements and bugfixes here, most importantly the backend will no longer block indefinitely while waiting for a message, but give up after a certain number of attempts (per default 100 attempts at 0.1 second intervals). This means that both bugs as well as node errors should now no longer lead to "hangs" but turn into proper errors after ~ 10 seconds!
That's it, if you have any questions, just ask!