envelope-project / laik

Other
9 stars 8 forks source link

Final performance improvements for the TCP backend #147

Closed AlexanderKurtz closed 6 years ago

AlexanderKurtz commented 6 years ago

This PR brings the final performance improvements for the TCP backend after extensive testing on the HimMUC. Overall, I am quite pleased with the the performance now; it's now roughly equivalent to MPI. I originally thought this was already the case, but it turns out that testing on my laptop (with essentially no network delays) and doing so on a real cluster (with very real network delays) are two very different things.

The performance improvements are mainly these four points:

  1. backend.c now does its send operations in parallel to its receive operations, which avoids the effective serialization of the previous approach
  2. similarly, minimpi.c now does the sends for the MPI_Comm_split() implementation in parallel to the receives, which avoids the same problem here
  3. backend.c no longer splits the its sends/receives in 1MiB chunks, but sends all the data in one go. This may consume more memory, but it allows to reduce the number of messages, which means better performance.
  4. Both the "client" and "server" part of the messenger implementation now are fully threaded, using 4 threads each per default. This means that one slow sender/receiver no longer blocks other messages from being delivered.

There are also several stability improvements and bugfixes here, most importantly the backend will no longer block indefinitely while waiting for a message, but give up after a certain number of attempts (per default 100 attempts at 0.1 second intervals). This means that both bugs as well as node errors should now no longer lead to "hangs" but turn into proper errors after ~ 10 seconds!

That's it, if you have any questions, just ask!

weidendo commented 6 years ago

give up after a certain number of attempts ... 10 seconds

What if a computation is so unbalanced that one process directly goes into wait, and the other process does 20 sec of calculation? This still can be a regular, correctly written code which obviously progresses. To really catch deadlocks, it seems to me that a deadlock detection algorithm needs to be in place.

AlexanderKurtz commented 6 years ago

The maximum number of attempts is obviously configurable (send_attempts and receive_attempts in the configuration file), as is the time between attempts (send_delay and receive_delay)! The ~10 seconds are just the default!

weidendo commented 6 years ago

Sure, but LAIK is a library. We always should have a default setting which allows correct applications to run until end.