Softswitch barrier eats non-barrier packets

mvousden commented 4 years ago

While the softswitch is at the softswitch barrier, and a non-barrier-breaking packet is received, the offending packet is silently dropped.

This is somewhat desirable - it stops rogue devices (from other applications) attempting to start devices from other applications out of sequence. However, when an application is spread sparsely over the hardware (POETS/Tinsel), barrier-breaking packets are not consumed at the same time (globally). This results in some devices "starting" earlier than others. If a device sends a packet to another device before it has started, that packet is ignored, effectively changing the behaviour of an application. For locally-synchonous applications, this is a disaster.

By way of example, consider:

This (test_ring.xml.txt) application (which I'll be adding to the test repository), which sends packets in a round-robin fashion across all devices, and mimics the local-synchronisation behaviour of some applications.
This (batch.txt) Orchestrator batch script, which places an application sparsely, randomly (seeded) across the hardware fabric, and runs it.
This (3fe9693ec4b212170785e143c0a45930d1d1e91f) commit, which represents a merge between the recent Tinsel 0.8.2 fixes and the new placement system (the most development-like commit possible, I think).

I ran this example on Heaney, which fails on the first round at device 3.

I make the following change to softswitch_common.cpp:

@@ -86,17 +92,23 @@
             // softswitch_alive(send_buf);
             // and once it's been received, process it as a startup packet
             softswitch_onReceive(ThreadContext, recv_buf);
             ThreadContext->ctlEnd = 0;
         }
+
+        // <!> If we receive a non-barrier message while we're waiting for a
+        // barrier message, we burn the envelope without opening it. This UART
+        // is the smoke coming out of the chimney.
+        else tinselUartTryPut(170); // "aa"
+
         tinselFree(recv_buf);

I observe a single "aa" UART output from the thread hosting the device after the last device that reported back, demonstrating that a packet is dropped.

I change my patch to this:

@@ -42,10 +42,16 @@
     {
         device_init(&ThreadContext->devInsts[device], ThreadContext);
     }
 }

+// <!> You know it's a hack when the "pragma GCC" comes out.
+#pragma GCC push_options
+#pragma GCC optimize ("O0")
+// 150 works consistently as an end-point in my testing. 125 does not.
+void delay(){for (uint32_t i=0; i<150; i++);}
+#pragma GCC pop_options

 /*------------------------------------------------------------------------------
  * softswitch_barrier: Block until told to continue by the mothership
  *----------------------------------------------------------------------------*/
 void softswitch_barrier(ThreadCtxt_t* ThreadContext)
@@ -86,17 +92,23 @@
             // softswitch_alive(send_buf);
             // and once it's been received, process it as a startup packet
             softswitch_onReceive(ThreadContext, recv_buf);
             ThreadContext->ctlEnd = 0;
         }
+
+        // <!> If we receive a non-barrier message while we're waiting for a
+        // barrier message, we burn the envelope without opening it. This UART
+        // is the smoke coming out of the chimney.
+        else tinselUartTryPut(170); // "aa"
+
         tinselFree(recv_buf);
     }
+
+    delay();  // <!>
 }
 //------------------------------------------------------------------------------

-
-
 /*------------------------------------------------------------------------------
  * Two utility inline functions to reduce code duplication when configuring
  * loop execution order.
  *----------------------------------------------------------------------------*/
 inline void receiveInline(ThreadCtxt_t* ThreadContext, volatile void* recvBuffer)

and observe that my application completes successfully. This change also "fixes" simulated annealing-placed tasks, in that they now run as intended.

I naturally conclude that a short delay of some kind is needed between "the breaking of the barrier on the softswitch" and "running the application".

How best to do this?

NB: I've assigned this to @heliosfa because I want his opinion, more than I necessarily want him to fix it.

NB: My modified softswitch_common.cpp.txt for reference. Sorry for the .txt uploads, but hey GH.

heliosfa commented 4 years ago

I was wondering whether this was going to turn out to be an issue and if so when it was going to rear its head (I was hoping not until we went multi-box) - I am sure I have flagged this as a potential "gotcha" before.

A delay is one approach but may end up not being scalable in future, especially with multi-box. I would also expect us to be in the order of ms to s rather than us.

Another possible option (if and when timeout support arrives) would be to use a tinselWaitUntil(TINSEL_CAN_RECV) to block until timeout or an incoming packet (which is hopefully an indicator that something else has already timed out).

A third option is to not throw away unexpected messages if they are for the correct task (still throw away things for a different task) and buffer them instead. When the barrier is released, the unexpected messages can be replayed. The problem with this approach is the size and location of the buffer - we don't have dynamic memory allocation so this will have to be a fixed size somewhere. Unless we do something nasty and punt the unexpected messages back into the network, possibly on a circuitous route...

Worth chatting about on Wednesday?

mvousden commented 4 years ago

A delay is one approach but may end up not being scalable in future, especially with multi-box. I would also expect us to be in the order of ms to s rather than us.

I don't understand why a delay doesn't scale. The delay would only happen at the start of the application, and would certainly be tiny compared to the duration of the application. It's a function of backend to be sure, but this is a softswitch-level change. Sure, you can never be sure what the duration of the delay should be to avoid dropping packets (that's the barrier problem), but there is no solution to that issue.

Another possible option (if and when timeout support arrives) would be to use a tinselWaitUntil(TINSEL_CAN_RECV) to block until timeout or an incoming packet (which is hopefully an indicator that something else has already timed out).

I don't understand how this helps.

A third option is to not throw away unexpected messages if they are for the correct task (still throw away things for a different task) and buffer them instead. When the barrier is released, the unexpected messages can be replayed. The problem with this approach is the size and location of the buffer - we don't have dynamic memory allocation so this will have to be a fixed size somewhere. Unless we do something nasty and punt the unexpected messages back into the network, possibly on a circuitous route...

This sounds horribly complicated, and offers little over the delay approach.

Worth chatting about on Wednesday?

Yes, let's.

mvousden commented 3 years ago

Resolved by the above commit, which has been merged in.

POETSII / Orchestrator

Softswitch barrier eats non-barrier packets #157