contiki-os / contiki

The official git repository for Contiki, the open source OS for the Internet of Things
http://www.contiki-os.org/
Other
3.71k stars 2.58k forks source link

CC2650 - Bad CRC after specfic packet transfer count #1175

Open bkozak-scanimetrics opened 9 years ago

bkozak-scanimetrics commented 9 years ago

Hello everyone,

This is a somewhat complicated problem with many details so I apologize in advance if my explanation is difficult to follow.

I have observed a bug with the CC2650 (srf06 board) where several concurrent bad CRCs are reported by the radio after approximatively 130 packets are exchanged (usually it is after exactly 130 TCP packets but it is always close to 130) after device reset. After the problem occurs once, it does not seem to occur again until device reset but will very consistently occur after ~130 packets are sent and received. This behaviour occurs with ContikiMAC and a channel check rate of 8Hz and does not occur if radio duty cycling is disabled.

To produce this bug I am starting a TCP upload from a CC2650 device through a border-router (therefore packets are both sent and received by the uploading node). After ~130 TCP packets are sent (and 130 TCP ACKs received) I see several bad CRCs in a row on the uploading node and bad CRC packets continue to be seen very frequently for some time after. The uploader will be unable to receive TCP ACKs throughout this event. If I turn off the bad CRC filter and examine the packet, the bad packet will usually (>95% of the time) have a length of 40 (I am sending packets of length 48) and will start with the same 4 bytes. Furthermore, when I watch the communications with a sniffer I can see no sign of any actual corrupted packets.

After spending a lot of effort debugging this issue I am beginning to suspect that this is may be an issue with the RF module and not with Contiki but I need to know if anyone else is seeing something similar.

If anyone is interested, I can provide code that can be used to reproduce this problem.

Thanks.

g-oikonomou commented 9 years ago

Never seen this. Does it happen if you slow the rate down? Does it happen over UDP?

bkozak-scanimetrics commented 9 years ago

Does it happen if you slow the rate down?

I take this to mean the CHANNEL_CHECK_RATE. After slowing down from 8 to 4 I still get this issue and, although you didn't ask, it still occurs after speeding up to 16. In each case the event still occurs at around 130 TCP packets exchanged (the radio counters will have around 0x8D nRxData and 0x8D nRxAck if I print from RX_NOK).

Does it happen over UDP?

Don't know about UDP yet but you gave me the idea to just try pinging through the border-router which unfortunately does not seem to reproduce the problem. I suspect that it could take me some time to find a way to reproduce this with just UDP packets but I'll see if I can come up with something.

Unfortunately it's looking like this issue is going to be a little more elusive than I was hoping.

g-oikonomou commented 9 years ago

I take this to mean the CHANNEL_CHECK_RATE

I was actually referring to slowing down the TCP upload at the TCP (or application) layer. Not the RDC channel check rate.

bkozak-scanimetrics commented 9 years ago

I was actually referring to slowing down the TCP upload at the TCP (or application) layer. Not the RDC channel check rate.

I've throttled the packet rate down to 1 every half second and still see the problem.

bkozak-scanimetrics commented 9 years ago

I've tried to re-produce with a simple ping-pong application using UDP and also using RIME but I'm still not seeing the same problem as with TCP. It would seem that the conditions under which this bug occurs are very particular.

The problem still occurs every time I use TCP.

bkozak-scanimetrics commented 8 years ago

I've just created github repos for a server (for Linux) and a client which can be used to reproduce this problem.

So far, all of my tests have used the CC2650EM. Although I haven't tried it, I don't see why this shouldn't happen on the sensortag as well (and the software does work with sensortag). Note also that I have conducted most of these tests in a 2 node network. As far as I know, adding more nodes could give you a different result (but I kind-of doubt it).

To reproduce the bug you will need to:

  1. Compile the rpl-border router example for CC2650 and connect to it via tunslip6 (LBR might work as well but I haven't tried this).
  2. Enable debugging in rf-core.c so you can see the "Bad CRC" messages (Optional).
  3. Compile the tpt client and flash to a CC2650. By default it runs a TCP based throughput test.
  4. Compile the server
  5. Disable or reconfigure your firewall (if necessary) to accept connections through port 3000 via the tunslip6 connection.
  6. Execute the server as ./bin/testServer -ts --port=3000
  7. Press the select button on the CC2650 EM to start the test.

Output from the test server will show the test progress. After approximately 130 packets you will see the progress stop for a few seconds. If you've enabled debugging in rf-core.c you will see many Bad CRC messages at this time. This same problem will probably repeat itself a few times throughout the test (with varying degrees of severity) but will then not happen again until the client is reset (at least as far as I know).

While this same issue does not happen every time it should happen almost every time.