osRtxThreadListPut hangs when the thread is already on the list

andyturk commented 6 years ago

My application hangs occasionally in this kernel function. Examination of the arguments in a debugger shows that when the second argument (the thread) is already on the list represented by the first argument. In this situation the function never returns, and the RTX kernel becomes hung.

/// Put a Thread into specified Object list sorted by Priority (Highest at Head).
/// \param[in]  object          generic object.
/// \param[in]  thread          thread object.
void osRtxThreadListPut (volatile os_object_t *object, os_thread_t *thread) {
  os_thread_t *prev, *next;
  int32_t      priority;

  if (thread == NULL) {
    return;
  }

  priority = thread->priority;

  prev = (os_thread_t *)(uint32_t)object;
  next = object->thread_list;
  while ((next != NULL) && (next->priority >= priority)) {
    prev = next;
    next = next->thread_next;
  }
  thread->thread_prev = prev;
  thread->thread_next = next;
  prev->thread_next = thread;
  if (next != NULL) {
    next->thread_prev = thread;
  }
}

In this case, the loop never exits because the next pointer is always non-NULL and the priority of the thread is always equal to the priority of the item in the list (since they're actually the same object). I've seen this occur while managing the round-robin thread list and also while having a thread wait for EventFlags.

osRtxThreadListPut already tries to be somewhat defensive by returning early when the passed thread is NULL. It should gracefully handle the case where the thread is already on the list.

Perhaps these application hangs are due to some other misconfiguration on my part, but whatever the root cause, tracking down the fix is more difficult because the kernel itself hangs.

JonatanAntoni commented 6 years ago

Hi @andyturk,

thanks for reporting this issue.

osRtxThreadListPut is an internal function and calling it with a thread already in the list should never happen. For sure one could catch that but not without additional costs.

We need to analyse this issue to track it down to the root cause. Is it feasible for you to provide a reproducer example?

Thanks again, Jonatan

andyturk commented 6 years ago

At the moment, I'm only able to reproduce it inside the application I'm working on, but if I can reduce it to a smaller example, I'll definitely share that.

andyturk commented 6 years ago

I instrumented osRtxThreadListPut to break instead of going into an infinite loop. In this particular example, the call came from osRtxTick_Handler where it deals with the round robin implementation:

        if (osRtxKernelGetState() == osRtxKernelRunning) {
          thread = osRtxInfo.thread.ready.thread_list;
          if ((thread != NULL) && (thread->priority == osRtxInfo.thread.robin.thread->priority)) {
            osRtxThreadListRemove(thread);
            osRtxThreadReadyPut(osRtxInfo.thread.robin.thread); <<<<< this call hangs
            osRtxThreadSwitch(thread);
            osRtxInfo.thread.robin.thread = thread;
            osRtxInfo.thread.robin.tick   = osRtxInfo.thread.robin.timeout;
          }
        }

The immediately preceeding call is to remove the currently running thread from the ready list, and then replace it with the next thread on the round robin list. At least, I think that's what's going on (please correct me!).

It's possible for a thread to be the currently running thread (aka osRtxInfo.thread.ready.thread_list) and also be at the head of the round robin queue (aka osRtxInfo.thread.robin.thread). At least, that's the situation when my application hangs. Here's what the global osRtxInfo struct looks like after the hang:

osRtxInfo.thread.ready.thread_list has the same value as osRtxInfo.thread.robin.thread: 0x20005AFC, which is also the thread passed to the hanging call to osRtxThreadListPut. Is that expected?

JonatanAntoni commented 6 years ago

Thanks for your additional investigation.

osRtxInfo.thread.ready.thread_list is the list of threads ready to run. The currently running thread osRtxInfo.thread.run.curr should not be part of the ready list.

We need some time to investigate in your observations.

andyturk commented 6 years ago

Jonatan,

Yes, osRtxInfo.thread.run.curr is part of the ready list when my application hangs:

Here's the call stack:

JonatanAntoni commented 6 years ago

@andyturk,

Thanks. That looks wrong to me. We need to trace this back to where the lists are messed up.

During normal operation you should be able to observe the following during context switch:

The first thread from the ready list is "moved" to be the thread.run.next.
The actual context switch is executed from thread.run.curr to thread.run.next.

At some point in time this seems to go wrong.

ReinhardKeil commented 6 years ago

@andyturk

The effect that you are seeing might be the result of a configuration issue.

Is it possible to get a bit information about your target system:

what processor/device are you using?
what is the RTX tick clock source? Are you using tick-less mode?
what are the ISR priority settings in your system?

andyturk commented 6 years ago

@ReinhardKeil

This code is running on nRF52.

The tick source is RTC1, and yes, we are using "tick-less" mode.

There is a regular peripheral interrupt which is setting an event flag, and a thread which processes the data. Compiled with debug enabled, that thread takes longer than the round-robin threshold to run and that's where things start to go sideways.

RTC1's interrupt priority is 7. The peripheral with the regular interrupt is also at 7, as are a number of other peripherals. A UART (for a command-line shell) is using priority 5.

Nordic's softdevice is present but not enabled when the kernel hangs.

In case it's relevant, here's the "guts" of the tickless implementation (in C++):

namespace {
inline void wait_for_event() {
    // this sequence puts the core to sleep!

    __SEV();
    __WFE();

#if defined(SOFTDEVICE_PRESENT)
    sd_app_evt_wait();
#else
    __WFE();
#endif
}

bool consider_debugger_presence = 1;
inline bool is_debugger_attached() {
    return consider_debugger_presence &&
           (CoreDebug->DHCSR & CoreDebug_DHCSR_C_DEBUGEN_Msk);
}

} // namespace

// http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0553a/Cihhjgdh.html
inline void send_event_on_irq_pend(bool value) {
  if (value) {
    SCB->SCR |=  SCB_SCR_SEVONPEND_Msk;
  } else {
    SCB->SCR &= ~SCB_SCR_SEVONPEND_Msk;
  }
}

void CustomerKernel::idle() {
    constexpr uint32_t FPU_EXCEPTION_MASK = 0x9f;

    while (1) {
        volatile uint32_t ticks_to_sleep = suspend();
        volatile uint32_t idle_entry     = rtc_.get_counter();

        if (ticks_to_sleep > 1) {
            rtc_.set_deadline_ticks_from_now(ticks_to_sleep);

            // 2017-09-23 AKT
            //
            // The system is about to go to sleep here. The desired
            // behavior is that the system will wake from sleep the
            // specified number of ticks later *OR* whenever an
            // interrupt occurs.
            //
            // The SEVONPEND mechanism sets an "event" bit in the processor
            // whenever an interrupt is pended within the NVIC. This happens
            // regardless of whether the interrupt is enabled. I.e., the system
            // will sleep as long a no interrupts are pending.
            //
            // The floating point interrupt is cleared here to ensure
            // that a math exception won't prevent the system from
            // entering sleep mode.

            __set_FPSCR(__get_FPSCR() & ~(FPU_EXCEPTION_MASK));
            (void)__get_FPSCR();
            NVIC_ClearPendingIRQ(FPU_IRQn);

            send_event_on_irq_pend(true);
            uint32_t error = sd_app_evt_wait();
            APP_ERROR_CHECK(error);

        } else {
          do {
            // nothing
          } while (rtc_.get_counter() == idle_entry);
        }

        volatile uint32_t idle_exit = rtc_.get_counter();
        volatile uint32_t ticks_actually_slept =
          (idle_exit - idle_entry) & RTC::RESOLUTION;

        resume(ticks_actually_slept);
    }
}

ReinhardKeil commented 6 years ago

The problem could be caused by the ISR priority of RTC which is the RTX timer source in your case. This ISR needs to be lowest priority.

We clarified this lately under https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/DoxyGen/RTOS2/src/cmsis_os2.txt - Using Interrupts on Cortex-M.

Can you change priority and check if this solves the issue.

ReinhardKeil commented 6 years ago

Sorry accidentally closed

andyturk commented 6 years ago

The (new) documentation has the following table:

| Handler | Priority | Interrupt/Exception                                                                            |
|---------+----------+------------------------------------------------------------------------------------------------|
| SysTick | lowest   | Kernel system timer interrupt to generate periodic timer ticks                                 |
| PendSV  | lowest   | PendSV (request for system-level service) when calling certain RTX functions from Handler mode |
| SCV     | lowest+1 | Supervisor Call used to enter the RTOS kernel from Thread mode                                 |

The meaning of "lowest" here is potentially ambiguous due to the way Cortex-M priorities work. I.e., an interrupt with a lower numerical value (as stored in the register) will preempt an interrupt of a higher priority. My interpretation of the table is that SysTick and PendSV should not preempt any other interrupt, and that SVCall should preempt those two.

Is that correct?

For my specific case, the nRF52 family implements only 3 bits of interrupt priority, so the hardware supports 0-7. The Nordic soft device further restricts priorities that are available to applications. SVCall is essentially hardwired to priority 4. Application interrupts are restricted to use priorities [2,3,5,6,7].

So having a Cortex-M priority of 4 for SVCall (as required by Nordic) is compatible with RTX5 if the PendSV and the tick interrupts are numerically higher values. This symptom reported here originally occurred with priorities: SVCall=4, PendSV=7 and tick (RTC1)=7.

I also tried configuring the system so that SVCall=4, PendSV=7 and tick (RTC1)=6--the idea being to ensure that PendSV will never preempt any other interrupt. The kernel still hangs in this case too.

RobertRostohar commented 6 years ago

The Interrupt Priority table interpretation should be: -SVC must not be preempted by SysTick or PendSV -SysTick and PendSV must not preempt each other

Basically SysTick and PendSV should have the same priority (typically the lowest in the system) where SVC should have at least one level higher priority. Note: lowest priority means the highest number.

The mentioned configuration “SVCall=4, PendSV=7 and tick (RTC1)=7 “ should be OK.

Would you be able to provide a reproducer example?

andyturk commented 6 years ago

Would you be able to provide a reproducer example?

I can try, but it might take a few days since I'm traveling now. Is the nRF52840 DK acceptable as a hw platform?

RobertRostohar commented 6 years ago

Let's first try to see the Events generated by RTX. Maybe this will tell more.

Could you add the Event Recorder Visibility to your project? See: http://www.keil.com/pack/doc/CMSIS/RTOS2/html/cre_rtx_proj.html#cre_rtx_proj_er

and then capture the Events. see: http://www.keil.com/support/man/docs/uv4/uv4_db_dbg_evr_view.htm

andyturk commented 6 years ago

Our build uses GCC and Makefiles, not Keil (apologies to @ReinhardKeil). Maybe I could get the event data out via Segger's RTT? But that would be a project too.

ReinhardKeil commented 6 years ago

@andyturk maybe you can describe the threads that are running on your system a bit?

How many do you have at which priority?
Which threads should timeout using Round Robin scheduling?

It seems that you use RTX 5.1.0 (about 2 years old). Since then several issues are fixed. See: http://arm-software.github.io/CMSIS_5/RTOS2/html/rtos_revisionHistory.html

Can you update the kernel and retest?

RobertRostohar commented 6 years ago

@andyturk Reproducer on nRF52840 DK would be fine.

andyturk commented 6 years ago

@RobertRostohar @ReinhardKeil

An update...

I've updated the RTOS to 5.3.0 (master branch). The same symptom is repdroducible on 5.3. I've also plumbed in Segger's SystemView to help show what's happening as the system executes. To review, the symptom is that the kernel hangs (variously) in loops that insert a thread into a sorted list of other threads. Typically, these lists are the ready list or the delay list.

Creating a stand-alone "reproducer" looks like a lot of work, since I'd need to replicate the tickless implementation without our RTOS abstraction layer (which sit on stop of CMSIS RTOS, which sits on top of RTX, ...). I'm hoping the job will be easier when I can actually determine what's causing the problem. I suspect this is a misconfiguration/misuse of something outside of RTX, but can't verify that yet. For now, all I know is the kernel gets into an infinite loop easily.

Reinhard asked about thread priorities:

There are 4 application threads. "charger" is at priority 32, while "blinky", "shell", and "wc" are all at 24. ISR 45 produces data for the "wc" thread. ISR 33 is the OS tick timer, ISR 19 is I2C, and ISR 18 is the UART used by the "shell" thread.

I've added assertions in the kernel itself to catch these loops as they're about to happen. Here's an example:

/// Insert a Thread into the Delay list sorted by Delay (Lowest at Head).
/// \param[in]  thread          thread object.
/// \param[in]  delay           delay value.
static void osRtxThreadDelayInsert (os_thread_t *thread, uint32_t delay) {
  os_thread_t *prev, *next;

  if (delay == osWaitForever) {
    prev = NULL;
    next = osRtxInfo.thread.wait_list;
    while (next != NULL)  {
      prev = next;
      next = next->delay_next;
      assert(next != prev); <<<<<<<<<<<<<<<<<<<< assertion added here
    }
    thread->delay = delay;
    thread->delay_prev = prev;
    thread->delay_next = NULL;
    if (prev != NULL) {
      prev->delay_next = thread;
    } else {
      osRtxInfo.thread.wait_list = thread;
    }
  } else {
    prev = NULL;
    next = osRtxInfo.thread.delay_list;
    while ((next != NULL) && (next->delay <= delay)) {
      delay -= next->delay;
      prev = next;
      next = next->delay_next;
      assert(next != prev); <<<<<<<<<<<<<<<<<<<< assertion added here
    }
    thread->delay = delay;
    thread->delay_prev = prev;
    thread->delay_next = next;
    if (prev != NULL) {
      prev->delay_next = thread;
    } else {
      osRtxInfo.thread.delay_list = thread;
    }
    if (next != NULL) {
      next->delay -= delay;
      next->delay_prev = thread;
    }
  }
}

Maybe a picture is worth 1k words:

This particular run had round-robin scheduling disabled (OS_ROBIN_ENABLE = 0). When round-robin is enabled, the system still hangs, but in a different place: osRtxThreadListPut.

Interrupt priorities:

> irq
           Reset = -3
  NonMaskableInt = -2
       HardFault = -1
MemoryManagement = 0
        BusFault = 0
      UsageFault = 0
          SVCall = 4
          PendSV = 7
         SysTick = 0
     POWER/CLOCK = 4
           UART0 = 5
      SPI0/TWIM0 = 6
          GPIOTE = 6
            RTC0 = 0
            RTC1 = 6
     COMP/LPCOMP = 6
       SWI2_EGU2 = 6
       SWI5_EGU5 = 4
             MWU = 1

In this case, the system runs fine for a while, with the "wc" thread processing data. The thread basically waits for an EventFlags object to signal that data is ready. The end of the trace shows the assertion failure in osRtxThreadDelayInsert. Note: "Idle" in the screen shot is intended to represent time between osKernelSuspend()/osKernelResume(). The "Scheduler" context is inferred somehow by SystemView and might not be correct. But the ISR activity is straightforward and the context switching in the trace comes directly from EvrRtxThreadSwitched in RTX.

This is turning out to be a thorny problem, and I'd appreciate any advice or suggestions on how to determine the root cause.

ReinhardKeil commented 6 years ago

@andyturk

Andy, thanks for all the details. We will analyze and get back to you.

The root cause of the issue could be the interrupt priority of SVC. All user interrupts (that interact with the RTOS) should have higher priority (= lower value in the register). This seems not the case. Can you double check?

JonatanAntoni commented 6 years ago

@andyturk, another configuration makes me wondering. I guess RTC1 is used for the tick, not SysTick? RTC1 has priority 6 but PendSV has 7. Does RTC1 interrupt preempt PendSV? That must not happen.

andyturk commented 6 years ago

@ReinhardKeil

I can still reproduce the hang if all the application interrupts that interract with the RTOS have a higher priority than SVC. In my case, that's UART0, SPI0/TWIM0, GPIOTE (which I forgot earlier) and COMP/LPCOMP. RTC1 is the tick interrupt. At runtime, the irqs are set as follows:

> irq
           Reset = -3
  NonMaskableInt = -2
       HardFault = -1
MemoryManagement = 0
        BusFault = 0
      UsageFault = 0
          SVCall = 4
          PendSV = 7
         SysTick = 0
     POWER/CLOCK = 4
           UART0 = 2
      SPI0/TWIM0 = 3
          GPIOTE = 3
            RTC0 = 0
            RTC1 = 7
     COMP/LPCOMP = 3
       SWI2_EGU2 = 6
       SWI5_EGU5 = 4
             MWU = 1

It hangs in osRtxThreadDelayInsert:

The IRQ relationships are codified as follows:

    // Scheduler IRQ requirements

    static_assert(APP_HIGH_IRQ_PRIORITY < APP_LOW_IRQ_PRIORITY,
                  "Higher priorities have lower numbers on Cortex-M");

    // from Robert Rostohar
    static_assert(!(TICK_IRQ_PRIORITY < SVC_IRQ_PRIORITY) &&
                  !(PENDSV_IRQ_PRIORITY < SVC_IRQ_PRIORITY),
                  "SVC must not be preempted by SysTick or PendSV");

    static_assert(TICK_IRQ_PRIORITY == PENDSV_IRQ_PRIORITY,
                  "SysTick and PendSV must not preempt each other");

    static_assert(!(PENDSV_IRQ_PRIORITY < APP_LOW_IRQ_PRIORITY),
                  "PendSV must not preempt other interrupts");

    // from Reinhard Keil
    static_assert(APP_LOW_IRQ_PRIORITY < SVC_IRQ_PRIORITY,
                  "Application interrupts that use the RTOS must preempt SVC");

@JonatanAntoni Yes, I'd tried earlier to make PendSV lower than any of the other interrupts. This time, however, PendSV and the tick interrupt have the same priority to meet Rostohar #2.

ReinhardKeil commented 6 years ago

@andyturk we did review, but we are unable to spot the root cause based on your reports.

Would it be possible to get a reproducer? You may send this via email to cmsis@arm.com.

Thanks Reinhard

JonatanAntoni commented 6 years ago

All,

we discussed the effects in more details and came up with an interesting fact about the device in question.

The nRF is driven by a Soft Device that encapsulates certain proprietary communication stacks, like Bluetooth. This part of the final application software takes direct control of the interrupt and exception handlers (like SVC). Hence the actual SVC handler used by RTX5 is only called indirectly. This might have unforeseen impact on the call stack. RTX5 has strong assumptions about the stack frame when the scheduler performs thread switches.

Unfortunately we are not yet able to reproduce the issue in an isolated environment. Any further insight could be helpful.

Cheers, Jonatan

ReinhardKeil commented 5 years ago

Closed as the issue is related to the Soft Device.

ARM-software / CMSIS_5

osRtxThreadListPut hangs when the thread is already on the list #363