Closed andyturk closed 5 years ago
Hi @andyturk,
thanks for reporting this issue.
osRtxThreadListPut
is an internal function and calling it with a thread already in the list should never happen. For sure one could catch that but not without additional costs.
We need to analyse this issue to track it down to the root cause. Is it feasible for you to provide a reproducer example?
Thanks again, Jonatan
At the moment, I'm only able to reproduce it inside the application I'm working on, but if I can reduce it to a smaller example, I'll definitely share that.
I instrumented osRtxThreadListPut
to break instead of going into an infinite loop. In this particular example, the call came from osRtxTick_Handler
where it deals with the round robin implementation:
if (osRtxKernelGetState() == osRtxKernelRunning) {
thread = osRtxInfo.thread.ready.thread_list;
if ((thread != NULL) && (thread->priority == osRtxInfo.thread.robin.thread->priority)) {
osRtxThreadListRemove(thread);
osRtxThreadReadyPut(osRtxInfo.thread.robin.thread); <<<<< this call hangs
osRtxThreadSwitch(thread);
osRtxInfo.thread.robin.thread = thread;
osRtxInfo.thread.robin.tick = osRtxInfo.thread.robin.timeout;
}
}
The immediately preceeding call is to remove the currently running thread from the ready list, and then replace it with the next thread on the round robin list. At least, I think that's what's going on (please correct me!).
It's possible for a thread to be the currently running thread (aka osRtxInfo.thread.ready.thread_list
) and also be at the head of the round robin queue (aka osRtxInfo.thread.robin.thread
). At least, that's the situation when my application hangs. Here's what the global osRtxInfo
struct looks like after the hang:
osRtxInfo.thread.ready.thread_list
has the same value as osRtxInfo.thread.robin.thread
: 0x20005AFC, which is also the thread passed to the hanging call to osRtxThreadListPut
. Is that expected?
Thanks for your additional investigation.
osRtxInfo.thread.ready.thread_list
is the list of threads ready to run. The currently running thread osRtxInfo.thread.run.curr
should not be part of the ready list.
We need some time to investigate in your observations.
Jonatan,
Yes, osRtxInfo.thread.run.curr
is part of the ready list when my application hangs:
Here's the call stack:
@andyturk,
Thanks. That looks wrong to me. We need to trace this back to where the lists are messed up.
During normal operation you should be able to observe the following during context switch:
thread.run.next
.thread.run.curr
to thread.run.next
.At some point in time this seems to go wrong.
@andyturk
The effect that you are seeing might be the result of a configuration issue.
Is it possible to get a bit information about your target system:
@ReinhardKeil
This code is running on nRF52.
The tick source is RTC1, and yes, we are using "tick-less" mode.
There is a regular peripheral interrupt which is setting an event flag, and a thread which processes the data. Compiled with debug enabled, that thread takes longer than the round-robin threshold to run and that's where things start to go sideways.
RTC1's interrupt priority is 7. The peripheral with the regular interrupt is also at 7, as are a number of other peripherals. A UART (for a command-line shell) is using priority 5.
Nordic's softdevice is present but not enabled when the kernel hangs.
In case it's relevant, here's the "guts" of the tickless implementation (in C++):
namespace {
inline void wait_for_event() {
// this sequence puts the core to sleep!
__SEV();
__WFE();
#if defined(SOFTDEVICE_PRESENT)
sd_app_evt_wait();
#else
__WFE();
#endif
}
bool consider_debugger_presence = 1;
inline bool is_debugger_attached() {
return consider_debugger_presence &&
(CoreDebug->DHCSR & CoreDebug_DHCSR_C_DEBUGEN_Msk);
}
} // namespace
// http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/Cihhjgdh.html
inline void send_event_on_irq_pend(bool value) {
if (value) {
SCB->SCR |= SCB_SCR_SEVONPEND_Msk;
} else {
SCB->SCR &= ~SCB_SCR_SEVONPEND_Msk;
}
}
void CustomerKernel::idle() {
constexpr uint32_t FPU_EXCEPTION_MASK = 0x9f;
while (1) {
volatile uint32_t ticks_to_sleep = suspend();
volatile uint32_t idle_entry = rtc_.get_counter();
if (ticks_to_sleep > 1) {
rtc_.set_deadline_ticks_from_now(ticks_to_sleep);
// 2017-09-23 AKT
//
// The system is about to go to sleep here. The desired
// behavior is that the system will wake from sleep the
// specified number of ticks later *OR* whenever an
// interrupt occurs.
//
// The SEVONPEND mechanism sets an "event" bit in the processor
// whenever an interrupt is pended within the NVIC. This happens
// regardless of whether the interrupt is enabled. I.e., the system
// will sleep as long a no interrupts are pending.
//
// The floating point interrupt is cleared here to ensure
// that a math exception won't prevent the system from
// entering sleep mode.
__set_FPSCR(__get_FPSCR() & ~(FPU_EXCEPTION_MASK));
(void)__get_FPSCR();
NVIC_ClearPendingIRQ(FPU_IRQn);
send_event_on_irq_pend(true);
uint32_t error = sd_app_evt_wait();
APP_ERROR_CHECK(error);
} else {
do {
// nothing
} while (rtc_.get_counter() == idle_entry);
}
volatile uint32_t idle_exit = rtc_.get_counter();
volatile uint32_t ticks_actually_slept =
(idle_exit - idle_entry) & RTC::RESOLUTION;
resume(ticks_actually_slept);
}
}
The problem could be caused by the ISR priority of RTC which is the RTX timer source in your case. This ISR needs to be lowest priority.
We clarified this lately under https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/DoxyGen/RTOS2/src/cmsis_os2.txt - Using Interrupts on Cortex-M.
Can you change priority and check if this solves the issue.
Sorry accidentally closed
The (new) documentation has the following table:
| Handler | Priority | Interrupt/Exception |
|---------+----------+------------------------------------------------------------------------------------------------|
| SysTick | lowest | Kernel system timer interrupt to generate periodic timer ticks |
| PendSV | lowest | PendSV (request for system-level service) when calling certain RTX functions from Handler mode |
| SCV | lowest+1 | Supervisor Call used to enter the RTOS kernel from Thread mode |
The meaning of "lowest" here is potentially ambiguous due to the way Cortex-M priorities work. I.e., an interrupt with a lower numerical value (as stored in the register) will preempt an interrupt of a higher priority. My interpretation of the table is that SysTick and PendSV should not preempt any other interrupt, and that SVCall should preempt those two.
Is that correct?
For my specific case, the nRF52 family implements only 3 bits of interrupt priority, so the hardware supports 0-7. The Nordic soft device further restricts priorities that are available to applications. SVCall is essentially hardwired to priority 4. Application interrupts are restricted to use priorities [2,3,5,6,7].
So having a Cortex-M priority of 4 for SVCall (as required by Nordic) is compatible with RTX5 if the PendSV and the tick interrupts are numerically higher values. This symptom reported here originally occurred with priorities: SVCall=4, PendSV=7 and tick (RTC1)=7.
I also tried configuring the system so that SVCall=4, PendSV=7 and tick (RTC1)=6--the idea being to ensure that PendSV will never preempt any other interrupt. The kernel still hangs in this case too.
The Interrupt Priority table interpretation should be: -SVC must not be preempted by SysTick or PendSV -SysTick and PendSV must not preempt each other
Basically SysTick and PendSV should have the same priority (typically the lowest in the system) where SVC should have at least one level higher priority. Note: lowest priority means the highest number.
The mentioned configuration “SVCall=4, PendSV=7 and tick (RTC1)=7 “ should be OK.
Would you be able to provide a reproducer example?
Would you be able to provide a reproducer example?
I can try, but it might take a few days since I'm traveling now. Is the nRF52840 DK acceptable as a hw platform?
Let's first try to see the Events generated by RTX. Maybe this will tell more.
Could you add the Event Recorder Visibility to your project? See: http://www.keil.com/pack/doc/CMSIS/RTOS2/html/cre_rtx_proj.html#cre_rtx_proj_er
and then capture the Events. see: http://www.keil.com/support/man/docs/uv4/uv4_db_dbg_evr_view.htm
Our build uses GCC and Makefiles, not Keil (apologies to @ReinhardKeil). Maybe I could get the event data out via Segger's RTT? But that would be a project too.
@andyturk maybe you can describe the threads that are running on your system a bit?
It seems that you use RTX 5.1.0 (about 2 years old). Since then several issues are fixed. See: http://arm-software.github.io/CMSIS_5/RTOS2/html/rtos_revisionHistory.html
Can you update the kernel and retest?
@andyturk Reproducer on nRF52840 DK would be fine.
@RobertRostohar @ReinhardKeil
An update...
I've updated the RTOS to 5.3.0 (master branch). The same symptom is repdroducible on 5.3. I've also plumbed in Segger's SystemView to help show what's happening as the system executes. To review, the symptom is that the kernel hangs (variously) in loops that insert a thread into a sorted list of other threads. Typically, these lists are the ready list or the delay list.
Creating a stand-alone "reproducer" looks like a lot of work, since I'd need to replicate the tickless implementation without our RTOS abstraction layer (which sit on stop of CMSIS RTOS, which sits on top of RTX, ...). I'm hoping the job will be easier when I can actually determine what's causing the problem. I suspect this is a misconfiguration/misuse of something outside of RTX, but can't verify that yet. For now, all I know is the kernel gets into an infinite loop easily.
Reinhard asked about thread priorities:
There are 4 application threads. "charger" is at priority 32, while "blinky", "shell", and "wc" are all at 24. ISR 45 produces data for the "wc" thread. ISR 33 is the OS tick timer, ISR 19 is I2C, and ISR 18 is the UART used by the "shell" thread.
I've added assertions in the kernel itself to catch these loops as they're about to happen. Here's an example:
/// Insert a Thread into the Delay list sorted by Delay (Lowest at Head).
/// \param[in] thread thread object.
/// \param[in] delay delay value.
static void osRtxThreadDelayInsert (os_thread_t *thread, uint32_t delay) {
os_thread_t *prev, *next;
if (delay == osWaitForever) {
prev = NULL;
next = osRtxInfo.thread.wait_list;
while (next != NULL) {
prev = next;
next = next->delay_next;
assert(next != prev); <<<<<<<<<<<<<<<<<<<< assertion added here
}
thread->delay = delay;
thread->delay_prev = prev;
thread->delay_next = NULL;
if (prev != NULL) {
prev->delay_next = thread;
} else {
osRtxInfo.thread.wait_list = thread;
}
} else {
prev = NULL;
next = osRtxInfo.thread.delay_list;
while ((next != NULL) && (next->delay <= delay)) {
delay -= next->delay;
prev = next;
next = next->delay_next;
assert(next != prev); <<<<<<<<<<<<<<<<<<<< assertion added here
}
thread->delay = delay;
thread->delay_prev = prev;
thread->delay_next = next;
if (prev != NULL) {
prev->delay_next = thread;
} else {
osRtxInfo.thread.delay_list = thread;
}
if (next != NULL) {
next->delay -= delay;
next->delay_prev = thread;
}
}
}
Maybe a picture is worth 1k words:
This particular run had round-robin scheduling disabled (OS_ROBIN_ENABLE
= 0). When round-robin is enabled, the system still hangs, but in a different place: osRtxThreadListPut.
Interrupt priorities:
> irq
Reset = -3
NonMaskableInt = -2
HardFault = -1
MemoryManagement = 0
BusFault = 0
UsageFault = 0
SVCall = 4
PendSV = 7
SysTick = 0
POWER/CLOCK = 4
UART0 = 5
SPI0/TWIM0 = 6
GPIOTE = 6
RTC0 = 0
RTC1 = 6
COMP/LPCOMP = 6
SWI2_EGU2 = 6
SWI5_EGU5 = 4
MWU = 1
In this case, the system runs fine for a while, with the "wc" thread processing data. The thread basically waits for an EventFlags object to signal that data is ready. The end of the trace shows the assertion failure in osRtxThreadDelayInsert. Note: "Idle" in the screen shot is intended to represent time between osKernelSuspend()/osKernelResume(). The "Scheduler" context is inferred somehow by SystemView and might not be correct. But the ISR activity is straightforward and the context switching in the trace comes directly from EvrRtxThreadSwitched
in RTX.
This is turning out to be a thorny problem, and I'd appreciate any advice or suggestions on how to determine the root cause.
@andyturk
Andy, thanks for all the details. We will analyze and get back to you.
The root cause of the issue could be the interrupt priority of SVC. All user interrupts (that interact with the RTOS) should have higher priority (= lower value in the register). This seems not the case. Can you double check?
@andyturk, another configuration makes me wondering. I guess RTC1 is used for the tick, not SysTick? RTC1 has priority 6 but PendSV has 7. Does RTC1 interrupt preempt PendSV? That must not happen.
@ReinhardKeil
I can still reproduce the hang if all the application interrupts that interract with the RTOS have a higher priority than SVC. In my case, that's UART0, SPI0/TWIM0, GPIOTE (which I forgot earlier) and COMP/LPCOMP. RTC1 is the tick interrupt. At runtime, the irqs are set as follows:
> irq
Reset = -3
NonMaskableInt = -2
HardFault = -1
MemoryManagement = 0
BusFault = 0
UsageFault = 0
SVCall = 4
PendSV = 7
SysTick = 0
POWER/CLOCK = 4
UART0 = 2
SPI0/TWIM0 = 3
GPIOTE = 3
RTC0 = 0
RTC1 = 7
COMP/LPCOMP = 3
SWI2_EGU2 = 6
SWI5_EGU5 = 4
MWU = 1
It hangs in osRtxThreadDelayInsert
:
The IRQ relationships are codified as follows:
// Scheduler IRQ requirements
static_assert(APP_HIGH_IRQ_PRIORITY < APP_LOW_IRQ_PRIORITY,
"Higher priorities have lower numbers on Cortex-M");
// from Robert Rostohar
static_assert(!(TICK_IRQ_PRIORITY < SVC_IRQ_PRIORITY) &&
!(PENDSV_IRQ_PRIORITY < SVC_IRQ_PRIORITY),
"SVC must not be preempted by SysTick or PendSV");
static_assert(TICK_IRQ_PRIORITY == PENDSV_IRQ_PRIORITY,
"SysTick and PendSV must not preempt each other");
static_assert(!(PENDSV_IRQ_PRIORITY < APP_LOW_IRQ_PRIORITY),
"PendSV must not preempt other interrupts");
// from Reinhard Keil
static_assert(APP_LOW_IRQ_PRIORITY < SVC_IRQ_PRIORITY,
"Application interrupts that use the RTOS must preempt SVC");
@JonatanAntoni Yes, I'd tried earlier to make PendSV lower than any of the other interrupts. This time, however, PendSV and the tick interrupt have the same priority to meet Rostohar #2.
@andyturk we did review, but we are unable to spot the root cause based on your reports.
Would it be possible to get a reproducer? You may send this via email to cmsis@arm.com.
Thanks Reinhard
All,
we discussed the effects in more details and came up with an interesting fact about the device in question.
The nRF is driven by a Soft Device that encapsulates certain proprietary communication stacks, like Bluetooth. This part of the final application software takes direct control of the interrupt and exception handlers (like SVC). Hence the actual SVC handler used by RTX5 is only called indirectly. This might have unforeseen impact on the call stack. RTX5 has strong assumptions about the stack frame when the scheduler performs thread switches.
Unfortunately we are not yet able to reproduce the issue in an isolated environment. Any further insight could be helpful.
Cheers, Jonatan
Closed as the issue is related to the Soft Device.
My application hangs occasionally in this kernel function. Examination of the arguments in a debugger shows that when the second argument (the thread) is already on the list represented by the first argument. In this situation the function never returns, and the RTX kernel becomes hung.
In this case, the loop never exits because the
next
pointer is always non-NULL
and the priority of the thread is always equal to the priority of the item in the list (since they're actually the same object). I've seen this occur while managing the round-robin thread list and also while having a thread wait for EventFlags.osRtxThreadListPut
already tries to be somewhat defensive by returning early when the passed thread isNULL
. It should gracefully handle the case where the thread is already on the list.Perhaps these application hangs are due to some other misconfiguration on my part, but whatever the root cause, tracking down the fix is more difficult because the kernel itself hangs.