mbed os crash on connecting ethernet adapter connect K64

NirSonnenschein commented 6 years ago

Description

Type: Bug
Priority: Major

we see an error when connecting the Ethernet adaptor adaptor on K64f running MbedOS compiled under ARMCC. This doesn't always happen, but occurs quite often. when it occurs we see the following print: Thread 00000000 error -4: Parameter error

Bug

Target K64F

Toolchain: ARM (mostly on armcc)

Toolchain version:

mbed-cli version: 5.6 and 5.7

mbed-os sha: (git log -n1 --oneline)

DAPLink version:

Expected behavior out code creates and Ethernet interface object calls the connect function. the connect doesn't succeed or return an org.

Actual behavior the test hangs and we see the following print: Thread 00000000 error -4: Parameter error

Steps to reproduce we run the following code as part of the network initialization for the K64F

 netInterface = new EthernetInterface();
printf("new interface created\r\n");
status = netInterface->connect();
if (NSAPI_ERROR_OK == status)
{
     printf("interface registered : OK \r\n");
}

0xc0170 commented 6 years ago

@NirSonnenschein somewhere in the code, osErrorParameter is captured, but it is not clear from where it comes, were you able to at least find out the function that is causing this error parameter. From the code snippet you shared above, I would guess connect() or anything missing there?

@kjbracey-arm @SeppoTakalo

kjbracey commented 6 years ago

I assume this is a debug profile build? The message comes from EvrRtxThreadError, which is a hook to catch RTX errors.

If you could stick a breakpoint on that to get a stack backtrace, it would help - I can see about a dozen places that could possibly call it with (NULL, osErrorParameter), and it's not obvious what the culprit would be. Most of them are simply calling functions with a NULL thread pointer, plus a couple of others.

This is during the connect call, right?

NirSonnenschein commented 6 years ago

Hi @0xc0170 and @kjbracey-arm, just to clarify, the error almost definitely happens in the connect call. we have a print before and after the call (in cases of success or failure) and when this happens we don't see any of the prints after connect.

as a general background this happens occasionally on our nightly tests (e.g. nightly for two nights ago had it, but last night didn't). the particular configuration which failed in this case was mbedOS compiled with armcc in debug mode. This issue doesn't seem to reproduce cleanly when testing locally. I'll try this again today, if I'm able to reproduce locally I can try to use a breakpoint. I can also provide the bin / elf for the image in question if that will help.

kjbracey commented 6 years ago

It seems moderately likely it might be the consequence of connection failure - some sort of teardown when giving up not going cleanly. Maybe you could encourage it by persuading connect failure - yank the cable at the crucial moment...

NirSonnenschein commented 6 years ago

Hi @kjbracey-arm , Thanks for the tip, I'll try this if I'm not able to locally reproduce the issue by normal means

NirSonnenschein commented 6 years ago

I've tried reproducing locally (including disconnecting the Ethernet wire during testing) and so far I have not been able to reproduce the issue. this seems to be more readily reproducible in the Jenkins test environment. when disconnecting the cable during the connect step the tests halt for a while (presumably waiting for HDCP to complete) and then fail (no crash observed).

kjbracey commented 6 years ago

Any chance it's this bug? https://github.com/ARMmbed/mbed-os/pull/5587

Can't immediately see why we'd hit it, but it is the same error printout.

alekshex commented 6 years ago

small update, happens on gcc arm also (caught in debug): new interface created Thread 0x0 error -4: Parameter error

NirSonnenschein commented 6 years ago

yes the issue seems to reproduce more easily in the Jenkins lab environment (happens there pretty often but I was not able to reproduce on the local network).

ryankurte commented 6 years ago

I'm having a similar / possibly the same issue during network stack init with mbed commit 4d81eadb2 using gcc-arm on the EFR32FG12_BRD4254A target. Mentioned in #5579 and manually applied the patch from #5587 with no effect.

Error occurs at rtos/TARGET_CORTEX/rtx5/RTX/Source/rtx_thread.c:1349 in uint32_t svcRtxThreadFlagsSet (osThreadId_t thread_id, uint32_t flags)

   |1346      // Check parameters                                                                                                                                                              │
   │1347      if ((thread == NULL) || (thread->id != osRtxIdThread) ||                                                                                                                         │
   │1348          (flags & ~((1U << osRtxThreadFlagsLimit) - 1U))) {                                                                                                                           │
B+>│1349        EvrRtxThreadError(thread, osErrorParameter);                                                                                                                                   │
   │1350        return ((uint32_t)osErrorParameter);                                                                                                                                           │
   │1351      }

Serial output:

[INFO][brro]: PANID: 691
[INFO][brro]: NET_IPV6_BOOTSTRAP_AUTONOMOUS
[WARN][brro]: Security NOT enabled
0m[DBG ][core]: NS Root task Init
[0m

[DBG ][sck ]: Socket Tasklet Generated
[sck ]: Socket Task
Thread 0x0 error -4: Parameter error

Backtrace:

Breakpoint 3, svcRtxThreadFlagsSet (thread_id=0x0 <osRegisterForOsEvents>, flags=512) at ./mbed-os/rtos/TARGET_CORTEX/rtx5/RTX/Source/rtx_thread.c:1349
(gdb) bt
#0  svcRtxThreadFlagsSet (thread_id=0x0 <osRegisterForOsEvents>, flags=512) at ./mbed-os/rtos/TARGET_CORTEX/rtx5/RTX/Source/rtx_thread.c:1349
#1  0x0004f324 in SVC_Handler () at irq_cm4f.S:59
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) p thread
$9 = (osRtxThread_t *) 0x0 <osRegisterForOsEvents>
(gdb) p *thread
$10 = {id = 0 '\000', state = 0 '\000', flags = 4 '\004', attr = 32 ' ',
  name = 0x53f41 <Reset_Handler> "H\200G\006I\aJ\aK\232B\276\277Q\370\004\vB\370\004\v\370\347\254\367", <incomplete sequence \371\135\100\005>, thread_next = 0x53f6d <WTIMER1_IRQHandler>,
  thread_prev = 0x58c09 <HardFault_Handler()>, delay_next = 0x53f6d <WTIMER1_IRQHandler>, delay_prev = 0x53f6d <WTIMER1_IRQHandler>, thread_join = 0x53f6d <WTIMER1_IRQHandler>,
  delay = 343917, priority = 109 'm', priority_base = 63 '?', stack_frame = 5 '\005', flags_options = 0 '\000', wait_flags = 343917, thread_flags = 343917,
  mutex_list = 0x4f311 <SVC_Handler>, stack_mem = 0x53f6d <WTIMER1_IRQHandler>, stack_size = 343917, sp = 324519, thread_addr = 324535, tz_memory = 343917,
  context = 0x5eb55 <FRC_PRI_IRQHandler>}
(gdb)

It appears something is prompting a SVC interrupt with an invalid thread ID, but i'm not sure how and haven't worked out how to catch it prior to execution yet.

kjbracey commented 6 years ago

Ta for the info!

That was enough to pin it down. (Despite the annoyance that debuggers keep failing to get through exception stack frames.)

It's an ordering error in the K64F driver - it's installing its interrupt handler in low_level_init via

ENET_SetCallback(&g_handle, ethernet_callback, netif);

ethernet_callback calls osThreadFlagsSet(k64f_enetdata.thread).

k64f_enetdata.thread isn't initialised until later, so there's a brief window where a receive interrupt can happen and ethernet_callback will use a null thread ID.

This is not terribly harmful, but the "trap errors" thing in the debug build intercepts it, reasonably enough.

Possible fixes:

change the start-up order so the thread is initialised first (should probably kill it again if low_level_init errors)
delay the ENET_SetCallback until after thread init (means you might process packets received during init much later - effectively existing behaviour)
make ethernet_callback check for thread id being NULL (same effect as previous)
don't use thread flags, use event flags, which means the callback can start setting them before the thread is initialised, and the thread will consume as soon as it starts

SeppoTakalo commented 6 years ago

Which Ethernet driver? The LwIP one or the Nanostack one, or both?

That later debug print looks like border router so I'm assuming it is Nanostack's driver or both.

kjbracey commented 6 years ago

Hang on, your #5579 is actually about a Nanostack issue. Not K64F at all. Oh well, you've helped solve this issue.

So it seems that both pieces of code probably have the same flaw - calling osThreadFlagsSet before the thread is ready. Not identified the path to it with Nanostack yet.

ryankurte commented 6 years ago

Yep, yep, different cause but suspect it's the same flaw. I can open another issue if you'd like?

Looks like in NanostackRfPhyEfr32.cpp callbacks are enabled at NanostackRfPhyEfr32.cpp#L374 and the thread isn't started until NanostackRfPhyEfr32.cpp#L468, will have a shot at reordering it and see if that helps.

I wonder what changed that this is now a runtime error / how many other things it is likely to effect.

kjbracey commented 6 years ago

This is only a runtime error with the RTX error trapping on, which is only in debug builds since 5.6 I think, unless that's changed. More people testing debug builds now?

ryankurte commented 6 years ago

The silent / nearly impossible to debug runtime error handling in release builds cost me almost a month of head bashing before I worked out #5155, I wouldn't be surprised at all if / hope that is the case.

ARMmbed / mbed-os

mbed os crash on connecting ethernet adapter connect K64 #5680

Description

Bug