Crash while ping-flooding the Thread interface

pekka-saastamoinen-etteplan commented 6 years ago

The nanostack-border-router can be crashed by pinging it heavily by about 8 Thread nodes. We first noticed this on our fork of the code and decided to test it on the reference hardware as well.

HW: Raspberry Pi 3 + 6lowpan shield

Raspi image from Mbed access point https://github.com/ARMmbed/mbed-access-point/blob/master/binaries/openwrt-mbedap-v4.0.1-brcm2708-bcm2710-rpi-3-ext4-sdcard.img.gz nanostack-border-router binary build from from https://github.com/ARMmbed/nanostack-border-router/commit/fa34a9d474adba58ea31752465ba39c6704995d2, GCC_ARM toolchain, almost stock SLIP config (matched channel, pan-id and keys with node setup).

Connected 8 Thread nodes which start connecting, wait 30 seconds and start pinging the BR every 50 ms each. The BR Mbed app crashes within about 30 seconds.

Crash variant 1: ++ MbedOS Error Info ++ Error Status: 0x80FF013D Code: 317 Module: 255 Error Message: Fault exception Location: 0x4B5D7 Error Value: 0x3F10 Current Thread: Id: 0x20004048 Entry: 0x14735 StackSize: 0x1800 StackMem: 0x20002848 SP: 0x2002FF40 -- MbedOS Error Info -- Crash Info: Crash location = mbed::TimerEvent::irq(unsigned long) [0x00003F10] (based on PC value) Caller location = ticker_irq_handler [0x0004A6B3] (based on LR value) Stack Pointer at the time of crash = [2002FFC8] Target and Fault Info: Processor Arch: ARM-V7M or above Processor Variant: C24 Forced exception, a fault with configurable priority has been escalated to HardFault A precise data access error has occurred. Faulting address: 20030008

Crash variant 2 : ++ MbedOS Error Info ++ Error Status: 0x80FF013D Code: 317 Module: 255 Error Message: Fault exception Location: 0x4B5D7 Error Value: 0xC4BF00BC Current Thread: Id: 0x20004048 Entry: 0x14735 StackSize: 0x1800 StackMem: 0x20002848 SP: 0x2002FF28 -- MbedOS Error Info -- Crash Info: Crash location = __init_array_end [0xC4BF00BC] (based on PC value) Caller location = mbed::Timeout::handler() [0x00003E3B] (based on LR value) Stack Pointer at the time of crash = [2002FFB0] Target and Fault Info: Processor Arch: ARM-V7M or above Processor Variant: C24 Forced exception, a fault with configurable priority has been escalated to HardFault MPU or Execute Never (XN) default memory map access violation on an instruction fetch has occurred

Attached logs for the first variant in the zip. bug_on_arm_hw_and_sw.zip

FYI: @karsev

ciarmcom commented 6 years ago

ARM Internal Ref: IOTTHD-2778

markus-becker-tridonic-com commented 6 years ago

@karsev @MarceloSalazar

MarceloSalazar commented 6 years ago

@pekka-saastamoinen-tridonic-com @markus-becker-tridonic-com Thanks for raising this - let us know once you have an application that can be used to reproduce this behavior. We've been investigating based on the information you've shared, but haven't seen the BR crashing so far. Thanks!

pekka-saastamoinen-etteplan commented 6 years ago

This branch should show the changes to mbed-os we applied to have the nodes ping the BR:

https://github.com/pekka-saastamoinen-tridonic-com/mbed-os/tree/pekka-saastamoinen-tridonic-com-ping

MarceloSalazar commented 6 years ago

@pekka-saastamoinen-tridonic-com what node application and configuration are you referring to? Can you fork the app, make the changes and point us at that, so we can just clone and reproduce the issue?

artokin commented 6 years ago

Not able to reproduce the hard-fault. BR runs out of memory during heavy ping testing but recovers at the end.

artokin commented 6 years ago

Error reproduced, there is a stack overflow in Nanostack code that is causing the hard fault. A fix will be released once it passes testing.

artokin commented 6 years ago

There was a recursive loop in MAC error handling that caused stack overflow. Hard fault happened because Mbed timers were using the corrupted stack.

Two PRs are now merged to master branch to remove recursion: https://github.com/ARMmbed/sal-stack-nanostack-private/pull/1826 and https://github.com/ARMmbed/sal-stack-nanostack-private/pull/1830.

PelionIoT / nanostack-border-router

Crash while ping-flooding the Thread interface #135