Tickless behavior when suspending for longer than expected

c1728p9 commented 7 years ago

If the kernel is suspended for more ticks than osKernelSuspend returns then the subsequent call to osKernelResume may update the osRtxInfo.kernel.tick by less than the number of ticks that have elapsed. This can happen if your device is halted by a debugger and has a clock which keeps counting in debug mode.

In osKernelResume there are three separate cases that can happen, one of which does not fully update ticks:

There are no tasks in osRtxInfo.thread.delay_list - ticks updated by full amount
The tasks in osRtxInfo.thread.delay_list are delayed for more than sleep_ticks - ticks updated by full amount
Tasks in osRtxInfo.thread.delay_list which have a delay less than sleep_ticks - ticks updated until the time all tasks have been moved out of the delay list.

Code where osRtxInfo.kernel.tick may not be fully updated: https://github.com/ARM-software/CMSIS_5/blob/c45122c9e87e265f010d59f7198ae872b015ccc3/CMSIS/RTOS2/RTX/Source/rtx_kernel.c#L413-L424

As a test I added osRtxInfo.kernel.tick += delay; after the while loop (after line 424) and the tick count was fully updated. Should I make a PR for this, or is this behavior intentional?

JonatanAntoni commented 7 years ago

Hi Russ,

this issue sounds like we should handle it very careful.

Suspending the kernel for longer than all threads are blocked is always bad and should be avoided. One tick more or less does not matter. But if you suspend the kernel for much more ticks you get all those ticks running by after suspend in "zero time".

Imagine you have a thread that should run once every 10 ticks (10ms). Suspend the kernel for lets say 50 ticks. The thread will now be executed 5 times without any delay in between. I guess that's rarely what you want.

If you do not have any timed behavior where do you recognize this tick drift else? Please consider the tick value not like a real time (wall clock) but as execution time. If you loose execution time for whatever reason the time is gone.

Cheers, Jonatan

c1728p9 commented 7 years ago

Hi @JonatanAntoni, this was found by an mbed-os tickless mode assert, which checks to make sure the actual tick of the OS matches the tick expected by tickless. It only happened when the device was halted by a debugger.

It sounds like the mbed-os implementation needs to be updated to cap the number of elapsed ticks to that returned by osKernelSuspend. Thanks for the response.

JonatanAntoni commented 7 years ago

Hi Russ,

ah, yes. Halting the device for debugging is always a mess for any time related stuff. Either you lose some timeouts or you create some on external devices. In this case it might be good to check if you can halt the free running timer as well. Taking the time spent in debug halt into account on resume mess up the behavior of your system completely.

Cheers, Jonatan

kjbracey commented 5 years ago

I'm working on suspend code, and I've just observed this exact issue by inspection, and I concluded that the same fix is required (added osRtxInfo.kernel.tick += delay; after the while loop). Was going to raise a new issue, but this popped up.

In my case, I am going for a deliberate suspend - for a time longer than the OS is expecting, and I was intending the application-visible tick count to be fully updated to consider the time spent in suspend (this seems to be the preference for monotonic timers in newer APIs). We're currently getting that from the kernel tick count, and any external fudges would be fragile, I feel.

I would be inclined to make sure this clock stops in debug mode - but I think that's a separate config setting, and not a counter argument to doing this for a software-initiated long suspend.

kjbracey commented 5 years ago

So, I would like to re-open this to discuss how to handle system suspend, and it sounds like I may need to confer with @c1728p9 on debug clock stopping.

RobertRostohar commented 5 years ago

Please note that osKernelResume has recently been updated (bd088a90f5e14cd94b7c0385b4f3fad99894b54c) and it should fix the problem initially reported in this issue.

kjbracey commented 5 years ago

Ah, that looks like it resolves this nicely. I accept that calling osKernelResume with a big number is going to cause scheduling hiccups, as discussed above, but at least this makes it more predictable if I choose to do it.

ARM-software / CMSIS_5

Tickless behavior when suspending for longer than expected #244