Realtek RTL8195AM - CMSIS-RTOS error: ISR Queue overflow (status: 0x2, task ID: 0x0, object ID: 0x30051484)

JanneKiiskila commented 6 years ago

Description

Type: Bug
Priority: Blocker for releasing support for Mbed Cloud Client / mbed-os-example-client

Bug

Target REALTEK_RTL8195AM

Toolchain: GCC_ARM

Toolchain version: mbed cli Windows installed toolchain 0.43 gcc_arm - same with Linux as well.

mbed-cli version: (mbed --version) 1.2.2

mbed-os sha: (git log -n1 --oneline)

2e1c2a1cd (HEAD -> master, origin/master, origin/feature-lorawan, origin/HEAD) Merge pull request #5538 from geky/littlefs-staging 41591eb83 Merge pull request #5602 from artokin/nanostack_release_v704

DAPLink version:

241 - this one is also a bit old It would be nice if Realtek could update the official DAPLINK you can download via their website.

=========================================================

ROM Version: 0.3

Build ToolChain Version: gcc version 4.8.3 (Realtek ASDK-4.8.3p1 Build 2003)

=========================================================
Check boot type form eFuse
SPI Initial
Image1 length: 0x3308, Image Addr: 0x10000bc8
Image1 Validate OK, Going jump to Image1

Expected behavior

mbed-os-example-client can run for a very long time. For example reference testing was done with K64F, it works fine.

Actual behavior

...
simulate button_click, new value of counter is 280
simulate button_click, new value of counter is 281
simulate button_click, new value of counter is 282
simulate button_click, new value of counter is 283
CMSIS-RTOS error: ISR Queue overflow (status: 0x2, task ID: 0x0, object ID: 0x30051484)

[mbed_die]  0x0 die here

Steps to reproduce

git clone mbed-os-example-client
modify mbed_app.json with a valid SSID/WIFI-passphrase.
set connectivity method to `WIFI_RTW`
mbed compile -m REALTEK_RTL8195AM -t GCC_ARM

(Have not tried other compilers, though).

JanneKiiskila commented 6 years ago

@tung7970 @Archcady

[Mirrored to Jira]

samchuarm commented 6 years ago

@ARMmbed/team-realtek [Mirrored to Jira]

Archcady commented 6 years ago

looking into this issue, will update ASAP. [Mirrored to Jira]

samchuarm commented 6 years ago

@Archcady [Mirrored to Jira]

Archcady commented 6 years ago

I am trying to test and debug this issue, but it takes a long time to reproduce. Is there a way to make the handle_timer_click trigger faster so that I can test and debug the point where it fails? [Mirrored to Jira]

tung7970 commented 6 years ago

@Archcady You can try to reduce the wait period in main.cpp. Currently it's 25 secs per loop.

updates.wait(25000);

[Mirrored to Jira]

Archcady commented 6 years ago

I updated this to 1000, still the time taken for the trigger is a lot

[Mirrored to Jira]

JanneKiiskila commented 6 years ago

We will be talking of hours, not days even with the default value.

[Mirrored to Jira]

Archcady commented 6 years ago

Have debugged this issue, the ISR queue overflow is happening as the result of the function call "release()" in Semaphore.cpp which in turn calls "osSemaphoreRelease". I checkked our driver from the realtek side, we are not calling the "release()" function at all in our code. It would be great if anyone from ARM side could help me understand this issue. These are my findings so far.

P.S. We are unable to debug with pyOCD as it is giving some error hence debugging is taking an extended amount of time. [Mirrored to Jira]

JanneKiiskila commented 6 years ago

Hei,

you are 100% sure you are not using semaphores anywhere else? I can see at least these calls using git grep under TARGET_Realtek.

TARGET_AMEBA/sdk/os/rtx2/rtx2_service.c:                osStatus_t status = osSemaphoreRelease(p_sem->id);
TARGET_AMEBA/sdk/os/rtx2/rtx2_service.c:                osStatus_t status = osSemaphoreRelease(p_sem->id);
T

Any unbalance in acquiring those semaphores vs. releasing could potentially cause an issue, right?

[Mirrored to Jira]

MarceloSalazar commented 6 years ago

@Archcady is there any update on this? [Mirrored to Jira]

samchuarm commented 6 years ago

Hi Marcelo, Realtek team is still trying to narrow down where the semaphore mismatch might come from. [Mirrored to Jira]

bkht commented 6 years ago

Hi, I have reproduced the same problem. Using the on-line compiler, I have successfully run on a NUCLEO-F746ZG: Getting started with mbed Client on mbed OS https://os.mbed.com/teams/mbed-os-examples/code/mbed-os-example-client/ More info: https://os.mbed.com/questions/80121/mbed-Client-on-mbed-OS-CMSIS-RTOS-error-/

I found that using some library (temperature sensor), to get some real-world data, causes this problem as soon as that library gets called, say mcp9808.readTemp(). That library works fine in a simple program.

[Mirrored to Jira]

samchuarm commented 6 years ago

So does this mean the ISR queue overflow is not platform dependent? @JanneKiiskila @Archcady any progress on this issue? [Mirrored to Jira]

prashantrar commented 6 years ago

From realtek side, we are still debugging the issue, but ill share some of my findings here in case there could be some pointers.

The ISR queue overflow happens always at a fixed amount of time, it takes approx 62-63mins for it to occur every single time.
the issue originates in the function "osRtxPostProcess" in Rtx_system.c where "osRtxErrorNotify" is called and the program terminates. void osRtxPostProcess (os_object_t *object) { if (isr_queue_put(object) != 0U) { if (osRtxInfo.kernel.blocked == 0U) { SetPendSV(); } else { osRtxInfo.kernel.pendSV = 1U; } } else { osRtxErrorNotify(osRtxErrorISRQueueOverflow, object); } }
The reason why "osRtxErrorNotify" is called is because just before the crash inside the function "" the "if" condition gets executed 16times. if (isr_queue_put(object) != 0U) { if (osRtxInfo.kernel.blocked == 0U) { SetPendSV(); } the "kernel.blocked" check fails and hence the same condition gets called 16 tines, and the size of the ISR queue defined is 16 and hence the queue overflows.
Surprisingly in the function "osRtxPostProcess " if i comment the call to "osRtxErrorNotify" then the program runs forever without issues.
Also in case I modify the example code and make it such that the semaphore release is done with a software timer rather than using the ticker, this issue dosent happen. Only when the ticker is used to release the semaphore, this issue is reproducible.

P.S. I am still debugging the issue, these are just my findings, if anyone from arm team could get some pointers from these findings and highlight anything that I am missing, please kindly help out. [Mirrored to Jira]

JanneKiiskila commented 6 years ago

There is something that's board specific (or driver specific) - K64F does not have this issue.

But, clearly it's now something that's impacting more than one board, if this happens also with NUCLEO-F746ZG.

[Mirrored to Jira]

samchuarm commented 6 years ago

Hi @JanneKiiskila do you think can commenting out osRtxErrorNotify in osRtxPostProcess or switching to use software timer in semaphore release be the fix? [Mirrored to Jira]

JanneKiiskila commented 6 years ago

I will admit my own limited knowledge at this stage and say I don't know. @kjbracey-arm, @geky, @sg- , or other Mbed OS team members would know better.

[Mirrored to Jira]

kjbracey commented 6 years ago

This is something it's quite easy to hit in RTX when using any RTOS operations from interrupt context. The RTOS work is always deferred onto this queue, so if you do 16 consecutive RTOS operations from interrupt before returning to thread context, it overflows.

I've raised one issue here for RTX suggesting how this could be improved, at least for flags. Not sure if the same logic could apply to semaphores. Maybe?

https://github.com/ARM-software/CMSIS_5/issues/283

Pending any RTX improvement, it's usually best to work around the issue by including logic to make sure you don't signal multiple consecutive times from interrupt. Some sort of "pending" flag which is cleared by the person who is monitoring the semaphore.

Do we really have no information about where the interrupt-context semaphore release triggering this is is coming from? No backtrace?

[Mirrored to Jira]

samchuarm commented 6 years ago

@prashantrar @ARMmbed/team-realtek [Mirrored to Jira]

prashantrar commented 6 years ago

We are having difficulty taking backtraces because the second the crash happens the stack is corrupt, but it originates from semaphore release all the time, beyond this the backtrace is unable to point out to specific functions usually just shows " ?? ()" in the backtrace. I will try to get proper backtraces once again tomorrow and update this ticket. [Mirrored to Jira]

prashantrar commented 6 years ago

@kjbracey-arm I am updating the latest backtrace with all the latest mbed-os components.

#0  osRtxErrorNotify () at .\mbed-os\rtos\TARGET_CORTEX\mbed_rtx_handlers.c  
#1  0x3001bb74 in isrRtxSemaphoreRelease ()  
at .\mbed-os\rtos\TARGET_CORTEX\rtx5\RTX\Source\rtx_semaphore.c:414  
#2  osSemaphoreRelease ()  
at .\mbed-os\rtos\TARGET_CORTEX\rtx5\RTX\Source\rtx_semaphore.c:461  
#3  0x300193f2 in ticker_irq_handler () at .\mbed-os\hal\mbed_ticker_api.c:  
#4  0x30022ff4 in HalTimerIrq2To7Handle_Patch (Data=<optimized out>)  
at ../../TARGET_Realtek/TARGET_AMEBA/TARGET_RTL8195A/device/rtl8195a_ti  
:45  
#5  0x000035de in ?? ()  
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

[Mirrored to Jira]

0xc0170 commented 6 years ago

@ARMmbed/team-realtek @JanneKiiskila Is this still a blocker and issue has not yet been fixed? [Mirrored to Jira]

samchuarm commented 6 years ago

@M-ichae-l , can you confirm if this issue has been addressed? [Mirrored to Jira]

adbridge commented 6 years ago

Internal Jira reference: https://jira.arm.com/browse/IOTPART-5928

MarceloSalazar commented 4 years ago

Closing as target won't be supported in Mbed 6 - https://github.com/ARMmbed/mbed-os/pull/12775

ARMmbed / mbed-os

Realtek RTL8195AM - CMSIS-RTOS error: ISR Queue overflow (status: 0x2, task ID: 0x0, object ID: 0x30051484) #5640

Description

Bug