RIOT-OS / RIOT

RIOT - The friendly OS for IoT
https://riot-os.org
GNU Lesser General Public License v2.1
4.92k stars 1.99k forks source link

LoRaWan node ISR stack overflowed #14962

Open MyCodeMyMoney opened 4 years ago

MyCodeMyMoney commented 4 years ago

Description

LoRaWan node ISR stack overflowed, when receiving error

Steps to reproduce the issue

Set the LoRaWan node to sleep mode, wake up the node every 20s, join and send data after wake-up, and then sleep. If the node repeats this process for a long time, When LoRaWan receives error, it will cause ISR stack overflowed

Actual results

This is the context information after node failure

[semtech-loramac] get dr
[semtech-loramac] MCPS request: confirmed TX
[semtech-loramac] MCPS request: OK
[semtech-loramac] Transmission completed
[semtech-loramac] MAC timer timeout
[semtech-loramac] MAC timer timeout
[semtech-loramac] RX started
[semtech-loramac] RX CRC error

Context before hardfault:
r0: 0x00000033
r1: 0x20012eab
r2: 0x200032c0
r3: 0x00000000
r12: 0x0493e000
lr: 0x08005347
pc: 0x0800538c
psr: 0x01000000

FSR/FAR:
CFSR: 0x00008200
HFSR: 0x40000000
DFSR: 0x00000000
AFSR: 0x00000000
BFAR: 0x20012ea9
Misc
EXC_

Cause Analysis

I found that in the following code, the length of the array LoRaMacRxPayload[LORAMAC_PHY_MAXPAYLOAD] is 255, but the parameter size may have an error value greater than 255

static uint8_t LoRaMacRxPayload[LORAMAC_PHY_MAXPAYLOAD];
...
static void OnRadioRxDone( uint8_t *payload, uint16_t size, int16_t rssi, int8_t snr )
{
    ...
        micRx |= ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN];
        micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 1] << 8 );
        micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 2] << 16 );
        micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 3] << 24 );
}

So I added the following code, when the function is called.

if (size > 255)
{
    size = 255
}

But when I continued to test under the above conditions, I found that when running the test for a long time, my LoRaWan node sometimes fell into an infinite loop. The abnormal phenomenon is that the sleep mode cannot be entered again, and there is no debugging information output.

Versions

RIOT Branch

2020.01-branch

Operating System Environment

     Operating System: "Ubuntu" "20.04.1 LTS (Focal Fossa)"
               Kernel: Linux 5.4.0-45-generic x86_64 x86_64
         System shell: /usr/bin/dash (probably dash)
         make's shell: /usr/bin/dash (probably dash)

Installed compiler toolchains

           native gcc: gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0
    arm-none-eabi-gcc: arm-none-eabi-gcc (GNU Tools for Arm Embedded Processors 9-2019-q4-major) 9.2.1 20191025 (release) [ARM/arm-9-branch revision 277599]
              avr-gcc: missing
     mips-mti-elf-gcc: missing
           msp430-gcc: missing
 riscv-none-embed-gcc: missing
 xtensa-esp32-elf-gcc: missing

xtensa-esp8266-elf-gcc: missing clang: missing

Installed compiler libs

 arm-none-eabi-newlib: "3.1.0"
  mips-mti-elf-newlib: missing

riscv-none-embed-newlib: missing xtensa-esp32-elf-newlib: missing xtensa-esp8266-elf-newlib: missing avr-libc: missing (missing)

Installed development tools

               ccache: missing
                cmake: missing
             cppcheck: missing
              doxygen: missing
                  git: git version 2.25.1
                 make: GNU Make 4.2.1
              openocd: missing
               python: Python 3.8.2
              python2: Python 2.7.18rc1
              python3: Python 3.8.2
               flake8: 3.8.3 (mccabe: 0.6.1, pycodestyle: 2.6.0, pyflakes: 2.2.0) CPython 3.8.2 on
           coccinelle: missing
aabadie commented 4 years ago

Cause Analysis

The code you are mentioning is from Loramac-node, right ?

So I added the following code, when the function is called.

Where do you add that code ? in $RIOTBASE/pkg/semtech-loramac/contrib/semtech-loramac.c ? It seems to me that this is a bug in the loramac-node code. Note that RIOT is still using version 4.4.1 of this code and the package needs some refresh (the current version is 4.4.4 from may 2020).

MyCodeMyMoney commented 4 years ago

Where do you add that code ? in $RIOTBASE/pkg/semtech-loramac/contrib/semtech-loramac.c ?

I did add the code in $RIOTBASE/pkg/semtech-loramac/contrib/semtech-loramac.c

It seems to me that this is a bug in the loramac-node code. Note that RIOT is still using version 4.4.1 of this code and the package needs some refresh (the current version is 4.4.4 from may 2020).

Can I use LoRaMac-node version 4.4.4 code in RIOT?I modified PKG_VERSION to 4.4.4, but the compilation failed.The branch I use is 2020.01-branch.

aabadie commented 4 years ago

Can I use LoRaMac-node version 4.4.4 code in RIOT?I modified PKG_VERSION to 4.4.4, but the compilation failed.The branch I use is 2020.01-branch.

Adapting the package to loramac-node 4.4.4 is quite some work since this project doesn't use a standard numbering for their release (4.4.1 => 4.4.2 reorganizes a lot of files, refactor some large part of the code, etc). So they are not patch versions...

MyCodeMyMoney commented 4 years ago

Adapting the package to loramac-node 4.4.4 is quite some work since this project doesn't use a standard numbering for their release (4.4.1 => 4.4.2 reorganizes a lot of files, refactor some large part of the code, etc). So they are not patch versions...

Since my node needs to enter the sleep mode during the working process, I cannot solve the above problems through watchdog. Do you have any good suggestions for the information I provided above?

aabadie commented 4 years ago

What is the platform you are using ? STM32,SAM0 ?

MyCodeMyMoney commented 4 years ago

What is the platform you are using ? STM32,SAM0 ?

I am using stm32l151cc platform

aabadie commented 4 years ago

@fjmolinas has a good knowledge on L1 so maybe we has an intuition on what could be the cause ?

fjmolinas commented 4 years ago

2020.01-branch

Can you test in a newer branch to see if the issue is still present?

But when I continued to test under the above conditions, I found that when running the test for a long time, my LoRaWan node sometimes fell into an infinite loop. The abnormal phenomenon is that the sleep mode cannot be entered again, and there is no debugging information output.

Can you elaborate?

I found that in the following code, the length of the array LoRaMacRxPayload[LORAMAC_PHY_MAXPAYLOAD] is 255, but the parameter size may have an error value greater than 255

That means the issue is upstream right? @aabadie you mention updating the pkg is not a short term option, can you confirm this bug? Should we provide a patch in the meantime?

MyCodeMyMoney commented 4 years ago

Can you test in a newer branch to see if the issue is still present?

I tested 2020.01-branch and 2020.04-branch, and they both have the same problem.

Can you elaborate?

There are the following fragments in /bin/pkg/l151-lorawan/semtech-loramac/src/mac/LoRaMac.c. This function calls LoRaMacRxPayload[size-LORAMAC_MFR_LEN], there is a hidden danger of crossing the boundary, because the length of the array definition is 255

define LORAMAC_PHY_MAXPAYLOAD 255

... static uint8_t LoRaMacRxPayload[LORAMAC_PHY_MAXPAYLOAD]; ... static void OnRadioRxDone( uint8_t *payload, uint16_t size, int16_t rssi, int8_t snr ) { ... micRx |= ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN]; micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 1] << 8 ); micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 2] << 16 ); micRx |= ( ( uint32_t )LoRaMacRxPayload[size - LORAMAC_MFR_LEN + 3] << 24 ); }

There are the following fragments in $RIOTBASE/pkg/semtech-loramac/contrib/semtech-loramac.c The function OnRadioRxDone is called here, and the length passed to LoRaMacRxPayload [LORAMAC_PHY_MAXPAYLOAD] is still at risk of greater than the maximum length of 255.I found this problem in the actual application caused ISR stack overflowed error in my code.But when I limit the length of len to 255 here, when an exception occurs, my code seems to fall into an infinite loop, unable to enter sleep mode, and no debugging information is output.

static void _semtech_loramac_event_cb(netdev_t *dev, netdev_event_t event) { netdev_lora_rx_info_t packet_info; msg_t msg; msg.content.ptr = dev switch (event) { ... case NETDEV_EVENT_RX_COMPLETE: { size_t len; uint8_t radio_payload[SX127X_RX_BUFFER_SIZE]; len = dev->driver->recv(dev, NULL, 0, 0); dev->driver->recv(dev, radio_payload, len, &packet_info); semtech_loramac_radio_events.RxDone(radio_payload, len, packet_info.rssi, packet_info.snr); break; } ... }

jia200x commented 4 years ago

@MyCodeMyMoney

Could you please try the following patch (untested)?

diff --git a/pkg/semtech-loramac/contrib/semtech_loramac.c b/pkg/semtech-loramac/contrib/semtech_loramac.c
index 0cd3a46b4f..368287f517 100644
--- a/pkg/semtech-loramac/contrib/semtech_loramac.c
+++ b/pkg/semtech-loramac/contrib/semtech_loramac.c
@@ -539,10 +539,16 @@ static void _semtech_loramac_event_cb(netdev_t *dev, netdev_event_t event)

         case NETDEV_EVENT_RX_COMPLETE:
         {
-            size_t len;
+            int len;
             uint8_t radio_payload[SX127X_RX_BUFFER_SIZE];
-            len = dev->driver->recv(dev, NULL, 0, 0);
-            dev->driver->recv(dev, radio_payload, len, &packet_info);
+            len = dev->driver->recv(dev, radio_payload, SX127X_RX_BUFFER_SIZE, &packet_info);
+            if (len < 0) {
+                /* the RxError call will be handled by the NETDEV_EVENT_CRC_ERROR called inside the recv function */
+                break;
+            }
             semtech_loramac_radio_events.RxDone(radio_payload,
                                                 len, packet_info.rssi,
                                                 packet_info.snr);

Let me know if that works.

jia200x commented 4 years ago

I actually found the issue. We have 2 problems: 1) The NETDEV_EVENT_CRC_ERROR is called inside the dev->driver->recv() function of netdev. As expected, the netdev recv function is called on NETDEV_EVENT_RX_COMPLETE (see here)

The semtech pkg expects the NETDEV_EVENT_RX_COMPLETE event to be "a frame was received and the CRC was OK", and the NETDEV_EVENT_CRC_ERROR to be "a frame was received but the CRC failed. So, the glue code should either call semtech_loramac_radio_events.RxDone or semtech_loramac_radio_events.RxError. Calling both might have undefined consequences.

2) The recv function return -EBADMSG after a CRC error. This means, the size_t len variable will get very high value that will make the buffer overflow.

Overall, the snippet in https://github.com/RIOT-OS/RIOT/issues/14962#issuecomment-689641486 should solve the issue. For mid term, we should probably move out the NETDEV_EVENT_CRC_ERROR from the recv function.

jia200x commented 4 years ago

Still, the RxDone function of the current semtech pkg is unsafe as hell... It might be possible to send a malformed packet and make it crash if the first byte is a JoinAccept frame and the CRC still passes... Maybe this was fixed in the upstream

MyCodeMyMoney commented 4 years ago

@jia200x I tested this patch file with two boards and they worked for about four days. One of them worked fine, but the other one was working abnormally. The exception log is as follows, LoRaMac repeatedly prints [semtech-loramac] MAC timer timeoutforever.

[sx127x] Set sleep [sx127x] Set op mode: SLEEP [sx127x] Change state: IDLE [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [sx127x] Set channel: 486700000 [sx127x] set modem: 1 [sx127x] already using modem: 1 [sx127x] Set freq hop: 0 [sx127x] Set bandwidth: 0 [sx127x] Set coding rate: 1 [sx127x] Set spreading factor: 7 [sx127x] Set CRC: 1 [sx127x] Set freq hop: 0 [sx127x] Set Hop period: 0 [sx127x] Set fixed header length: 0 [sx127x] Set IQ invert: 0 [sx127x] Set payload len: 0 [sx127x] Set power: 17 [sx127x] Set preamble length: 8 [sx127x] Set RX single: 0 [sx127x] Set TX timeout: 3000000 [sx127x] Set max payload len: 22 [sx127x] Set payload len: 22 [sx127x] Set standby [sx127x] Set op mode: STANDBY [sx127x] Change state: IDLE [sx127x] Change state: TX [sx127x] Set op mode: TRANSMITTER [sx127x] Change state: IDLE [sx127x] Set sleep [sx127x] Set op mode: SLEEP [sx127x] Change state: IDLE [sx127x] Set sleep [sx127x] Set op mode: SLEEP [sx127x] Change state: IDLE [semtech-loramac] Transmission completed [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [sx127x] Set channel: 507100000 [sx127x] set modem: 1 [sx127x] already using modem: 1 [sx127x] Set bandwidth: 0 [sx127x] Set spreading factor: 7 [sx127x] Set coding rate: 1 [sx127x] Set preamble length: 8 [sx127x] Set fixed header length: 0 [sx127x] Set payload len: 0 [sx127x] Set CRC: 0 [sx127x] Set freq hop: 0 [sx127x] Set Hop period: 0 [sx127x] Set IQ invert: 1 [sx127x] Set symbol timeout: 114 [sx127x] Set RX single: 1 [sx127x] Set max payload len: 235 [sx127x] Set RX timeout: 3000000 [sx127x] Set RX [sx127x] Change state: RX [sx127x] Set op mode: RECEIVER SINGLE [semtech-loramac] RX started [sx127x] Change state: IDLE [sx127x] Set sleep [sx127x] Set op mode: SLEEP [sx127x] Change state: IDLE [semtech-loramac] MAC timer timeout [semtech-loramac] MCPS indication event [semtech-loramac] MCPS indication msg received [semtech-loramac] MCPS indication Proprietary [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout [semtech-loramac] MAC timer timeout

jia200x commented 4 years ago

@jia200x I tested this patch file with two boards and they worked for about four days. One of them worked fine, but the other one was working abnormally. The exception log is as follows, LoRaMac repeatedly prints [semtech-loramac] MAC timer timeout forever.

It seems the MAC got stuck when receiving a Proprietary frame (or something went wrong that made the stack believe that there was a Proprietary frame). I will dig deeper.

FYI, RIOT also has it's own implementation of LoRAWAN (GNRC LoRaWAN). @MichelRottleuthner has been running a low power application for several months with that one. Maybe you could try that one in the meantime? Check examples/gnrc_lorawan.

Anyway, I will investigate this problem.

MyCodeMyMoney commented 4 years ago

@jia200x Okay, if you make any progress, I would be happy to assist you with the test.

aabadie commented 4 years ago

@jia200x any progress on this one ?

jia200x commented 3 years ago

We should still keep this one open

jia200x commented 3 years ago

@jia200x any progress on this one ?

Not yet, I will try to go back to this ASAP

maribu commented 1 year ago

Ping?