POETSII / Orchestrator

The Orchestrator is the configuration and run-time management system for POETS platforms.
1 stars 1 forks source link

Mothership freezes with some large problem sizes, deadlocking the system #247

Open heliosfa opened 3 years ago

heliosfa commented 3 years ago

Background

On 1.0.0-alpha, the Mothership has a propensity for freezing with some larger problems on the 8-box system.

This manifests with the Reactive application when there are 50 nodes and 1001 timesteps (50050 devices) and the application is constrained to two devices per thread, giving 25025 threads total. The application seems to "lock up" at random points during execution and the Orchestrator hangs on exit. No instrumentation is produced.

Everything 50 nodes and larger fails with the same symptoms. Constraining the 50-node problem to three devices per thread works consistently.

Interestingly, 40 nodes (40040 devices) constrained to one device-per-thread (40040 threads) works while 80 nodes (80080) constrained to two devices-per-thread (40040 threads) does not.

Digging

Initially, I was under the impression that this was an issue within the Softswitch and spent a lot of time trying to debug it. As time went on, it was looking more and more like a Mothership issue to @mvousden took on the torch.

@mvousden found that the Backend input broker thread is hanging waiting for a mutex lock (the Debug input broker thread was also NOT hanging when it should have done). The mutex was not being release by the Backend output broker thread as it was stuck in a blocking send.

Further digging revealed that the Mothership was in the process of sending the barrier release packets when it became blocked. This stopped the Mothership from servicing received packets, causing the network to backup. This meant that the Backend output broker never sent the packet that it was trying to.

In other words, a perfect storm for a network deadlock.

TL;DR: Mothership became blocked sending barrier release packets, application floods network with actual traffic (inc. the Mothership), everything gets stuck. We did not go to the moon today.

The way forward

@mvousden has identified a couple of ways forward:

  1. Increase the Softswitch barrier release delay (the time the softswitch waits after receiving a barrier release packet before moving into the main loop) to 50,000 loop iterations. This number was plucked out of the air.
  2. Change the Send in the Backend output broker to a TrySend and sugar coat it in some back-off logic.

Both of these get the problem application running. The first completes in ~1s, the second in ~0.7s when used independently.

I have some further suggestions to improve this state of things:

More to come.

mvousden commented 3 years ago

Note that #248 addresses the second point in @heliosfa 's explanation, and not the first. From testing, we found that the second suggestion alone was sufficient for the case that was causing the trouble, but we should be aware of this issue for the future (which is why it remains open).

heliosfa commented 3 years ago

Multicast the barrier release packets.

I have been digging into this bit a little more to see what we can do. As expected, Hostlink does not expose the local multicast functionality. Hostlink does expose programmable routers, but the documentation on using them is incomplete.

I had a chat with @jordmorr about how he is doing it for Imputation and he helpfully walked me through things. Jordan has done this in raw tinsel at https://github.com/POETSII/jordmorr-tinsel/blob/multtests/apps/imputation/run.cpp

Interesting lines:

This is all done after you have loaded the binaries but before everything is started (with StartAll in Jordan's case).

The issue with using programmable routers is that the multicast record always replace the least-significant 16-bits of the packet received by the the threads with a value specified in the routing record. This will stop us having a single general multicast destination that is usable for other purposes, unless we tweak the packet format to ensure that the device index is in the lowest 16-bits and that we always set it to FF.