SerialP corrupts packets under heavy load (breaks interlayer signalling)

GoogleCodeExporter commented 8 years ago


This problem is present in the tinyos svn core as of 9/30/2010, rev 5194 
(git://hinrg.cs.jhu.edu/git/tinyos-2.x.svn, 562c381)

Under heavy receive loads the serial stack can corrupt and/or lose incoming 
packets.

SerialP is part of the tinyOS serial stack.  It collects bytes from underlying 
layers and signals the next higher layer when packets start, on byte 
availability, and packet end.

The signals for packet start and end are critical to the correct functioning of 
the next higher layer as these events are used for controlling buffer 
allocation and mutual exclusion of that packet data.  Allocation, mutual 
exclusion protection, and data storage are performed by the higher layer which 
in this case is SerialDispatcherP.

SerialDispatcherP implements a two slot (double buffer) receive path.  It 
allocates (locks) a slot (and its associated buffer) when a startPacket is 
received and it hands the buffer to the next higher layer for processing when 
it receives an endPacket(SUCCESS).  It unlocks and recycles a buffer when it 
receives an endPacket(FAIL).  This makes the buffer available for future 
incoming data.

For proper functioning, interlayer signaling should only allow sequences that 
have a successful startPacket followed by an endPacket.   endPacket without a 
preceeding valid startPacket are illegal and cause the upper layer to misbehave.

Due to problems with SerialP's implementation, solitary endPackets can be 
generated.  This can occur in the following cases:

1) Invalid serial protocol.  The first byte following the HDLC start of packet 
delimiter is the low level serial protocol.  Anything other than 
SERIAL_PROTO_PACKET_ACK will cause an endPACKET(FAIL) to be signalled.

2) If startPacket() (the upper layer signal) returns FALSE, endPACKET(FAIL) is 
signalled.

3) At the termination of a valid packet, endPacket(SUCCESS) immediately 
followed by endPacket(FAIL) is signalled.

(Note: #3 above has been fixed by a fix from Phil Levis, rev 5204, hinrg git: 
c143598ae)

This problem will only cause a failure iff there are two buffers at the 
SerialDispatcherP layer waiting to be delivered.  This delivery is done via a 
task.  This situation will be more likely to occur if there is heavy receive 
traffic and/or heavy task loading (delaying the processing of incoming packets).

When the faulty endPacket(FAIL) occurs, the result is a valid buffer is freed 
and SerialDispatcherP will now consider it available for any new incoming 
packets.  The failure scenerio will only be seen if new data does arrive.   One 
of the following will be seen:

1) no problem.   No new data has come in before the packet has been processed 
by the delivery task and the intended receipient.  Full processing has to be 
completed before new incoming data shows up.

2) New incoming data has started and the packet buffer is partially 
overwritten.   Beginning of the buffer is the new packet while the trailer is 
the old packet.

3) New incoming data completes and the buffer now contains the state for the 
new packet.  The previous packet is lost.

Note that in all of these cases, the dispatcher task has been launched and the 
buffer that is now marked free will be delivered when the dispatcher task runs. 
 What will be delivered is indeterminate and depends on external arrival of 
incoming data.

What steps will reproduce the problem?

Demonstrating this problem is difficult because it is a series of race 
conditions.  

Heavy receive traffic is needed.  When the original error occurs it opens a 
window that closes when a packet has completed processing.  During the window 
time, new receive data will cause the problem to manifest.

Heavy task loading will delay the delivery of a completed buffer which causes 
the window to stay open longer.

Actual failures have been observed in a system being debugged using JTAG and 
GDB using a hand built scenerio.

Original issue reported on code.google.com by cire...@gmail.com on 17 Oct 2010 at 11:43

GoogleCodeExporter commented 8 years ago

Attached is the proposed fix.

This fix closes all holes that could cause erroneous endPacket signals from 
being generated.  startPacket is only issued on the transition to SEQNO and 
endPacket is only signalled if the state machine is in either SEQNO or INFO.

This code has been tested in its corner cases using gdb/jtag on a msp430f2618 
based mote.  

SerialP needs to be used with the modified SerialDispatchP that is part of 
Issue 2.  Both files need to be replaced to fix bugs in both that can cause 
packet lossage.

Original comment by cire...@gmail.com on 20 Oct 2010 at 10:02

Attachments:

SerialP.nc

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Generally, it's bad to use enums as a type (e.g. rxstate_t). The C compiler 
defaults enums to ints; if the int is 16 bits, you waste a byte. These are the 
kinds of little stylistic changes, which, when mixed in with functional 
changes, lead code to bitrot.

Same as the reversal of the RXSTATE_NOSYNC case. Is there a functional problem 
here, or a stylistic one? The difference seems to be return vs. break. Why 
reverse the statement? Was there a problem?

The RXSTATE_PROTO has a similar reversal, although now it looks functionally 
different. Generally, the entire receive switch has been rewritten. It's not 
clear which of the changes are intended to fix bugs 1 and 2, and which are just 
stylistic. I'd feel much more comfortable if we had a version which fixes the 
bugs, then one which stylistically refactors the code. In the former, it's 
possible to through thought experiments and testing to determine if the bug is 
fixed. In the latter, it's possible through thought experiments and testing to 
determine if the code is equivalent. Since the two are mixed, it's very hard to 
do either.

Why is #2 a bug? Is the desired protocol that endPacket is only paired on a 
successful startPacket? The text above seems to suggest that this should be the 
case, as do the interface comments. But #2 suggests there shouldn't be. Which 
is right?

#1 seems like a definite bug, which requires only a tiny change (the case in 
RXSTATE_PROTO not going to nosync but rather resetting the state machine). Why 
is this not sufficient?

Original comment by philip.l...@gmail.com on 27 Oct 2010 at 9:50

GoogleCodeExporter commented 8 years ago

With respect to the use of ENUMs:

It is not stylistic sugar but rather an issue of trade offs.

On the MSP430 the toolchain allocates the native size which is 16 bits.  The 
trade off is code vs. ram.

If an ENUM is used it uses an extra byte of RAM but generates better code.

Using an uint8_t on the other hand saves the byte of RAM but everywhere the 
object is referenced extra code is generated that maps the 8 bit datum to the 
underlying machine size.  (in this case 16 bits).

So the question is where does one want to burn the memory, code or data?  1 
extra byte of RAM vs. multiple code bytes.

An associated impact is how does the object interact with symbolic debuggers 
and debugging this code.  The majority of time getting code working properly is 
debugging and symbolic debuggers greatly aid in that.  Using enums works better 
with the debugger and is symbolic while uint8_t is just a number.

Original comment by cire...@gmail.com on 28 Oct 2010 at 5:03

GoogleCodeExporter commented 8 years ago

Same as the reversal of the RXSTATE_NOSYNC case. Is there a functional problem 
here, or a stylistic one? The difference seems to be return vs. break. Why 
reverse the statement? Was there a problem?

----------------

The rx_state_machine was rewritten to make it simpler and easier to understand. 
 Being easy to understand and clean to look at makes it much more likely that 
the state machine will do the right thing.  If it is easy to look at and 
understand it should be easy to see its correctness.  That was the thinking 
anyway.

States that are complete immediately return as opposed to breaking and then 
flowing through the end of the code either via the break or a goto.    States 
that needed to abort would do so via a break and the abort code immediately 
followed the switch.

This seems to be much cleaner, easier to understand.  Okay subjective.

The reversal of the RXSTATE_NOSYNC case is stylistic.  Minimize indent coupled 
with early processing if possible.

Original comment by cire...@gmail.com on 28 Oct 2010 at 5:47

GoogleCodeExporter commented 8 years ago

1) Invalid serial protocol.  The first byte following the HDLC start of packet 
delimiter is the low level serial protocol.  Anything other than 
SERIAL_PROTO_PACKET_ACK will cause an endPACKET(FAIL) to be signalled.

---

#1 seems like a definite bug, which requires only a tiny change (the case in 
RXSTATE_PROTO not going to nosync but rather resetting the state machine). Why 
is this not sufficient?

+++++++++++++++

The intent of the label "nosync:" and the goto nosync is to reset the state 
machine.  The label "nosync" is a place in the code that forces the rx state 
machine to the NOSYNC state while also resetting whatever is necessary to fully 
reset the machine.  Counters and other state variables.  It also tells the 
underlying layer to reset.

So the intent is indeed to reset the state machine in a well defined common 
manner.

One could set the rx_state to RXSTATE_NOSYNC but this special cases things 
without clearing out underlying state variables.  Is that what you meant by 
"resetting the state machine"?

Seemed like having a consistent mechanism for "processing complete" and for 
"abort" was a good idea.

+++++++++++++++

Due to problems with SerialP's implementation, solitary endPackets can be 
generated.  This can occur in the following cases:

2) If startPacket() (the upper layer signal) returns FALSE, endPACKET(FAIL) is 
signalled.

---

Why is #2 a bug? Is the desired protocol that endPacket is only paired on a 
successful startPacket? The text above seems to suggest that this should be the 
case, as do the interface comments. But #2 suggests there shouldn't be. Which 
is right?

+++++++++++++++

Yes, endPacket should only be allowed to follow an startPacket that returns 
SUCCESS.   #2 violates this and is a bug.   The endPacket(FAIL) is signalled 
and is a bug.

Premise is startPacket returns FALSE, ==> upper layer buffer is locked.   Then 
SerialP signals endPacket(FAIL), this causes the current buffer to be unlocked 
and made available for a future receive.

Because this buffer was originally locked, it was handed off to the 
receiveTask.  The unlock violates mutual exclusion of the packet data.

That is why #2 is a bug.  Is there some interaction with SerialDispatcherP such 
that the premise would be invalid?

Original comment by cire...@gmail.com on 28 Oct 2010 at 6:31

GoogleCodeExporter commented 8 years ago

[deleted comment]

GoogleCodeExporter commented 8 years ago

Can we close on this? Eric, your most recent emails suggest that fixing the 
start/end problem is sufficient.

Original comment by philip.l...@gmail.com on 13 Nov 2010 at 1:12

Changed state: Started

GoogleCodeExporter commented 8 years ago

This issue can be closed.   The analysis of the problem involved a premise that 
isn't true:

"This problem will only cause a failure iff there are two buffers at the 
SerialDispatcherP layer waiting to be delivered."

SerialDispatcherP does not allow for two packets to be pending at the same time 
rendering the analysis moot.

Original comment by cire...@gmail.com on 13 Nov 2010 at 1:42

GoogleCodeExporter commented 8 years ago

Original comment by philip.l...@gmail.com on 13 Nov 2010 at 2:22

Changed state: Fixed

135u5 / tinyos-main

SerialP corrupts packets under heavy load (breaks interlayer signalling) #1