EFR32: Mutex error in nanostack with EFR32 Phy

ryankurte commented 7 years ago

Description

Type: Bug & (Enhancement & Question)
Priority: In the way of our using mbed-os

Calling the mesh.connect() method causes Mutex 0x2000157c error -6: Not allowed in ISR context.

I'm working on porting mbed-os and getting the network stack working on EFR32FG devices, based on the existing EFR32MG support, but having some troubles with ISRs trying to use the platform critical section mutex in what appears to be the precompiled network stack.

Bug / Question

Target EFR32FG12

Toolchain: GCC_ARM

Toolchain version: arm-none-eabi-gcc (GNU Tools for ARM Embedded Processors 6-2017-q2-update) 6.3.1 20170620 (release) [ARM/embedded-6-branch revision 249437]

mbed-cli version: 1.2.0

mbed-os version: mbed-os-5.4.0-rc1-2137-g9b01a67

Expected behavior Building with nanostack should hopefully work.

Actual behavior Error in mutex access from ISR in NanostackRfPhyEfr32.cpp caused by underlying call to platform_enter_critical embedded in libnanostack.a.

Causes error output: Mutex 0x2000157c error -6: Not allowed in ISR context when compiled in debug mode, and nasty startup failures if not.

Steps to reproduce Build and run ryankurte/node-mbed on an EFR32FG12 target (make f to use JLink adaptor as board does not have mbed file system support).

Optionally add a breakpoint instruction __asm__("#BKPT 01") here to assist with catching the error.

Run debug server and gdb with make ds and make d.

Details

Building with the debug profile (mbed-cli compile --profile mbed-os/tools/profiles/debug.json)

Main looks like:

LoWPANNDInterface mesh;
NanostackRfPhyEfr32 rf_phy;

int main(int argc, char **argv) {
  int res;

  printf("\nInitialising mbed-os\n");

  res = mesh.initialize(&rf_phy);
  if (res < 0) {
    printf("Mesh error: %d\n", res);
    led2 = 1;
  }

  printf("Mesh initialised, connecting...\n");

  res = mesh.connect();
  if (res < 0) {
    printf("Connect error: %d\n", res);
    led2 = 1;
  }

  printf("Awaiting connection...\n");

  while (1) {
    led1 = !led1;
    wait(0.5);
  }
}

Which gives the output:

Initialising mbed-os
Mesh initialised, connecting...
Mutex 0x2000157c error -6: Not allowed in ISR context

Dropping a n __asm__("BKPT #X); into error_msg in rtx_handlers.c to catch that error gives me the following stack trace.

   │73      static const char* error_msg(int32_t status)                                                                                             │
   │74      {                                                                                                                                        │
  >│75          __asm__("BKPT #9");                                                                                                                  │
   │76          switch (status) {

#0  error_msg (status=-6) at ./mbed-os/rtos/TARGET_CORTEX/mbed_rtx_handlers.c:75
#1  0x0000b93e in EvrRtxMutexError (mutex_id=0x2000157c <critical_mutex>, status=-6) at ./mbed-os/rtos/TARGET_CORTEX/mbed_rtx_handlers.c:116
#2  0x0000e2aa in osMutexAcquire (mutex_id=0x2000157c <critical_mutex>, timeout=4294967295) at ./mbed-os/rtos/TARGET_CORTEX/rtx5/rtx_mutex.c:461
#3  0x00002508 in platform_enter_critical () at ./mbed-os/features/FEATURE_COMMON_PAL/nanostack-hal-mbed-cmsis-rtos/arm_hal_interrupt.c:30
#4  0x0003687c in timer_mac_stop ()
#5  0x00035e2e in mac_data_interface_tx_done_cb ()
#6  0x000360fa in mac_pd_sap_data_cb ()
#7  0x0001af60 in sw_mac_net_phy_tx_done ()
#8  0x00008930 in RAILCb_TxPacketSent (txPacketInfo=0x2003ff94)
    at ./mbed-os/features/nanostack/FEATURE_NANOSTACK/targets/TARGET_SL_RAIL/NanostackRfPhyEfr32.cpp:706
#9  0x00016be8 in pktTxDoneEvt ()
#10 0x0001868e in GENERIC_PHY_FRC_IRQCallback ()
#11 <signal handler called>
#12 0x0000cc4c in __svcKernelStart () at ./mbed-os/rtos/TARGET_CORTEX/rtx5/rtx_kernel.c:79
#13 osKernelStart () at ./mbed-os/rtos/TARGET_CORTEX/rtx5/rtx_kernel.c:554
#14 0x0000b66e in mbed_start_main () at ./mbed-os/rtos/TARGET_CORTEX/mbed_boot.c:332
#15 0x0000b766 in software_init_hook () at ./mbed-os/rtos/TARGET_CORTEX/mbed_boot.c:533
#16 0x000001a6 in _start ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
(gdb)

It appears to me that the EFR32 nanostack PHY TX complete callback is calling the nanostack virtual mac, which is then calling the underlying platform_enter_critical function, which attempts to grab the critical section mutex and fails because it's from an ISR. Which afaik seems like a sensible thing to do with a tx complete interrupt.

As far as I can tell, the issue is kindof a duplicate of mbed-os-example-client#284 mbed-os-example-client#303, #4834 and #4904 but as applied to libnanostack which as a closed source blob we cannot reasonable fix :-/

Any ideas on how to solve this? I am working on an update to 5.5, but it appears this issue exists across mbed-os versions. One of the other issues drops the mutex attempt from the driver, except it's not the driver layer that is attempting it in this case. Other option I have come up with is to implement a radio thread to wrap the interrupt based phy, but it seems like a bit of a bodge.

Also as a sidenote, this error appearing as a hardfault / causing startup failure in non-debug builds is (imo) very difficult to debug, it might be interesting to wrap this error in a more useful manner for other build types as well as debug.

Thanks,

Ryan

ryankurte commented 7 years ago

On further investigation it seems that the NCS36510 driver avoids this by running a thread to delegate calls to the underlying driver, which is a little more difficult (lots of data to communicate back from interrupts) but probably achievable. It might also be interesting to make the mac callbacks thread / interrupt safe to avoid the need for this.

Implementing the bare minimum to continue the choose your own adventure nets me:

#0  mesh_system_heap_error_handler (event=NS_DYN_MEM_ALLOCATE_SIZE_NOT_VALID)
    at ./mbed-os/features/nanostack/FEATURE_NANOSTACK/mbed-mesh-api/source/mesh_system.c:58
#1  0x00002ff4 in heap_failure (book=0x20003084 <app_stack_heap>, reason=NS_DYN_MEM_ALLOCATE_SIZE_NOT_VALID)
    at ./mbed-os/features/FEATURE_COMMON_PAL/nanostack-libservice/source/nsdynmemLIB/nsdynmemLIB.c:64
#2  0x00003212 in convert_allocation_size (book=0x20003084 <app_stack_heap>, requested_bytes=65462)
    at ./mbed-os/features/FEATURE_COMMON_PAL/nanostack-libservice/source/nsdynmemLIB/nsdynmemLIB.c:174
#3  0x0000329a in ns_mem_internal_alloc (book=0x20003084 <app_stack_heap>, alloc_size=65462, direction=-1)
    at ./mbed-os/features/FEATURE_COMMON_PAL/nanostack-libservice/source/nsdynmemLIB/nsdynmemLIB.c:210
#4  0x00003490 in ns_mem_alloc (heap=0x20003084 <app_stack_heap>, alloc_size=65462)
    at ./mbed-os/features/FEATURE_COMMON_PAL/nanostack-libservice/source/nsdynmemLIB/nsdynmemLIB.c:300
#5  0x000034da in ns_dyn_mem_alloc (alloc_size=65462)
    at ./mbed-os/features/FEATURE_COMMON_PAL/nanostack-libservice/source/nsdynmemLIB/nsdynmemLIB.c:310
#6  0x00034d96 in mac_mlme_data_base_allocate ()
#7  0x0001b3ea in ns_sw_mac_create ()
#8  0x00006b7c in nd_tasklet_network_init (device_id=0 '\000')
    at ./mbed-os/features/nanostack/FEATURE_NANOSTACK/mbed-mesh-api/source/nd_tasklet.c:431
#9  0x00005f68 in LoWPANNDInterface::init (this=0x20007954 <mesh>)
    at ./mbed-os/features/nanostack/FEATURE_NANOSTACK/mbed-mesh-api/source/LoWPANNDInterface.cpp:67
#10 0x00005ea6 in LoWPANNDInterface::connect (this=0x20007954 <mesh>)
    at ./mbed-os/features/nanostack/FEATURE_NANOSTACK/mbed-mesh-api/source/LoWPANNDInterface.cpp:27
#11 0x000156aa in main (argc=0, argv=0x0 <osRegisterForOsEvents>) at ./source/main.cpp:34
(gdb)

Which looks a tad like the stack is trying to allocate 65K to ndp, off to work out how / where to configure that.

0xc0170 commented 7 years ago

cc @SeppoTakalo @stevew817 @akselsm

betzw commented 7 years ago

Seems as if in both this issue and in issue #303 for mbed-os-example-client the problem is due to calling platform_enter_critical() in interrupt context. Unfortunately, the two call stacks which lead to calling platform_enter_critical() in interrupt context are completely different (seems even that they have nothing in common until platform_enter_critical() ) . The call stack trace for issue #303 can be found here.

ryankurte commented 7 years ago

So as a follow up to that issue, have poorly hacked the driver to pass signals back to a thread and juggled stack sizes and we're all working.

@betzw it just comes back to not being able to use mutexes in ISRs I guess, as a function of mbed-os not supporting nested interrupts, which is fine if documented / not needed / not used.

Needs to be polished / PR'd, but the solution is in sight. It would be super cool if nanostack was ok with being called from an ISR, and also if it didn't crash horribly if you do do it in any mode but debug, but we're on the way up ^_^

SeppoTakalo commented 7 years ago

Nanostack cannot be OK when called from ISR context as it used mutexes for protection. We don't have interrupt safe mutexes in Mbed OS.

Requirement for using the worker thread is documented in Nanostack's driver API https://docs.mbed.com/docs/arm-ipv66lowpan-stack/en/latest/driver_api/

stevew817 commented 7 years ago

A fix for this is in the works. I'll PR it when ready.

danielklioc commented 7 years ago

Getting same error on NUCLEOF767ZI

stevew817 commented 7 years ago

@danielklioc Please open a new ticket for NUCLEO since this is a platform-specific issue. NUCLEOF767ZI doesn't contain a radio, so it is highly unlikely it is the same root cause.

ARMmbed / mbed-os