ARMmbed / mbed-os

Arm Mbed OS is a platform operating system designed for the internet of things
https://mbed.com
Other
4.67k stars 2.98k forks source link

nRF52 timer problem #3857

Closed tbondar closed 7 years ago

tbondar commented 7 years ago

Description

I'm trying to get the BLE_Button example working with low power enabled. Based on the answer to this question I compiled the example code with RTOS disabled (by adding .mbedignore as suggested). Power consumption went down as expected but the LED stopped blinking after a couple of hours of running. This makes the BLE unusable on nRF52 for any practical purpose because either it has too high power consumption or stops working after a few hours.


Bug

Target NRF52_DK

Toolchain: GCC_ARM

Toolchain version: gcc-arm-none-eabi-4_9-2015q3

mbed-cli version: 0.9.10

mbed-os sha: 5faf4b2 Merge pull request #3837 from ARMmbed/release-candidate

Expected behavior BLE_Button example should keep working with low power enabled.

Actual behavior After some time the LED stops blinking. The BLE interface still works, i.e. I can connect to the device and read characteristics correctly. When there is some BLE activity the LED blinks once or twice. It seems that the EventQueue stops calling the blinkCallback() function.

This happens after a couple of hours of running. It seems that the exact time required depends on the on the event dispatch frequency:

eventQueue.call_every(n, blinkCallback);

If n is 50, LED stops blinking after 8200 seconds, if n is 20, it stops after about 6670 seconds.

Steps to reproduce Download the official BLE_Button example, disable RTOS by adding the following .mbedignore to the project and compile the code.

mbed-os/rtos/*
mbed-os/features/FEATURE_CLIENT/*
mbed-os/features/FEATURE_COMMON_PAL/*
mbed-os/features/FEATURE_UVISOR/*
mbed-os/features/frameworks/*
mbed-os/features/net/*
mbed-os/features/netsocket/*
mbed-os/features/storage/*

Additional Comments I have been trying to make BLE examples work with low power for a few days now using various methods. I tried using waitForEvent() and Tickers with a very similar result. It seems that some low level code interfacing the nRF52 timer might be buggy in some way.

pan- commented 7 years ago

@tbondar Thanks for the detailed bug report.

After some time the LED stops blinking. The BLE interface still works, i.e. I can connect to the device and read characteristics correctly. When there is some BLE activity the LED blinks once or twice. It seems that the EventQueue stops calling the blinkCallback() function.

Did you observe the same behavior if the RTOS is enabled ?

tbondar commented 7 years ago

I couldn't reproduce the problem with RTOS enabled (i.e. the default .mbedignore file). I keep my test running just in case, but it seems this problem is related to when low power mode is enabled one way or another.

pan- commented 7 years ago

@nvlsianpu could you look into this ?

tbondar commented 7 years ago

@pan- @nvlsianpu Just wondering if there's a chance to get a fix for this soon or I'd better look for some workaround. Thanks.

nvlsianpu commented 7 years ago

I started with looking into. I encountered that this if was not fullfiled. After code had gone through "that if" it weren't generated periodical event requested by eventQueue.call_every method.

@tbondar could you confirm that it behaves similar on your desk? You can for example use printf for alternate condition of "this if".

tbondar commented 7 years ago

@nvlsianpu (Previous comment updated after further tests)

You are probably right. I made the following changes to equeue_mbed.cpp and let the application running. When it breaks, LED3 stops blinking and LED4 does not start blinking. LED 4 toggles occasionally when certain BLE events happen or the button is pressed. Apparently, variable ms is negative or zero.

--- a/events/equeue/equeue_mbed.cpp
+++ b/events/equeue/equeue_mbed.cpp
@@ -128,12 +128,17 @@ static void equeue_sema_timeout(equeue_sema_t *s) {
     *s = -1;
 }

+DigitalOut  led3(LED3);
+DigitalOut  led4(LED4);
+
 bool equeue_sema_wait(equeue_sema_t *s, int ms) {
     int signal = 0;
     Timeout timeout;
     if (ms > 0) {
         timeout.attach_us(callback(equeue_sema_timeout, s), ms*1000);
-    }
+        led3 = !led3;
+        led4 = 1;
+    } else { led4 = !led4; }

     core_util_critical_section_enter();
     while (!*s) {
nvlsianpu commented 7 years ago

We encountered the same. Great! It looks like some code had been executing to long and therefore a event time to setup had been overdue and then expected new event didn't been requested anymore.

tbondar commented 7 years ago

It's interesting that it happens after exactly the same time (~ 8200 seconds) while the code is advertising only. I don't need to stress it in any way to reproduce it. Another comment is that a very similar thing happens with RTOS enabled, please see my comments at #3509 and here.

nvlsianpu commented 7 years ago

@pan- Why the creating evet-queue requests tick timer to start? here

pan- commented 7 years ago

Why the creating evet-queue requests tick timer to start? here

@nvlsianpu Do you think it should be started only when the tick timer is actually in use; when their is at least one event in the event queue to execute in N ticks ?

I think it is a commodity solution, the equeue_tick function is used by all event queues the system can create and is not tied to any instance so it is easier to think about it if the tick is always running.

@geky Am I correct ?

nvlsianpu commented 7 years ago

Do you think it should be started only when the tick timer is actually in use; when their is at least one event in the event queue to execute in N ticks ?

I think it could be kept as is for now - this is very rare tick and RTC consume a few uA. But it could be a case for ultra low power devices (like a flowmeter).


If the time is close to the period which was requested by creating of the evnt-queue it is possible the following code execution:

ev_dispatch -> next_evt_set

It looks like sometimes after that sequence, the expected periodic event (e.g. of 50 ms) is never requested again.

I discovered that RTC CC event were always being requested periodically: 1st by the ticker form RTC handler and very soon 2nd by the event dispatcher - until described sequence.

CC set twice                           CC set twice              CC set once
  1  2                                      1  2                           1
__|__|______________________________________|__|_______......______________|_.___________

I will back to investigation on monday.

geky commented 7 years ago

Ah, yeah, the event queue assumes the equeue_tick is always running.

The queue keeps track of the last tick a dispatch occured on. This is used to decide if a cancelled event is still in the queue (here). We also don't assume when the tick starts, so choosing 0 may happen in the near future.

We could add something that decides this tick lazily as @pan- suggests, but currently the queue simply loads from the timer during initialization.


Could it be that equeue_sema_wait is getting called with a wait of zero? This can happen if it takes enough time between dispatching events and going back to wait (here). It looks like the non-rtos equeue_sema_wait may not handle zero correctly (here).

 bool equeue_sema_wait(equeue_sema_t *s, int ms) {
     int signal = 0;
     Timeout timeout;
-    if (ms > 0) {
+    if (ms >= 0) {
         timeout.attach_us(callback(equeue_sema_timeout, s), ms*1000);
     }

     core_util_critical_section_enter();
     while (!*s) {
         sleep();
         core_util_critical_section_exit();
         core_util_critical_section_enter();
     }
pan- commented 7 years ago

I discovered that RTC CC event were always being requested periodically: 1st by the ticker form RTC handler and very soon 2nd by the event dispatcher - until described sequence.

One thing I have noticed in the debugger is that every time the LED stop blinking, events are still scheduled in the queue, the low power timer is still running but the value of the CC register is set to 0. I doubt this is the issue but the documentation is not very clear about this situation:

COMPARE occurs when a CC register is N and the COUNTER value transitions from N-1 to N.


I discovered that RTC CC event were always being requested periodically: 1st by the ticker form RTC handler and very soon 2nd by the event dispatcher - until described sequence.

There is one things to notice in the RTC documentation:

• If the COUNTER is N, writing N or N+1 to a CC register may not trigger a COMPARE event

It seems that the implementation is protected against this case: (here). But in reality, the code makes two wrong assumptions:

It would be interesting to protect this code by a critical section and use uint32_t closest_safe_compare = common_rtc_32bit_ticks_get() + 3; as the minimal value the CC register can accept (stopping the counter is too expensive).

pan- commented 7 years ago

@tbondar Could you try with this patch:

--- a/targets/TARGET_NORDIC/TARGET_NRF5/us_ticker.c
+++ b/targets/TARGET_NORDIC/TARGET_NRF5/us_ticker.c
@@ -190,13 +190,16 @@ void common_rtc_set_interrupt(uint32_t us_timestamp, uint32_t cc_channel,
     // difference between the compare value to be set and the current counter
     // value is 2 ticks. This guarantees that the compare trigger is properly
     // setup before the compare condition occurs.
-    uint32_t closest_safe_compare = common_rtc_32bit_ticks_get() + 2;
+
+    __disable_irq();
+    uint32_t closest_safe_compare = common_rtc_32bit_ticks_get() + 3;
     if ((int)(compare_value - closest_safe_compare) <= 0) {
         compare_value = closest_safe_compare;
     }

     nrf_rtc_cc_set(COMMON_RTC_INSTANCE, cc_channel, RTC_WRAP(compare_value));
     nrf_rtc_event_enable(COMMON_RTC_INSTANCE, int_mask);
+    __enable_irq();
 }
 //------------------------------------------------------------------------------

@nvlsianpu Can it be the only source of error ?

nvlsianpu commented 7 years ago

@pan- It doesn't look as bad CC set. I have a picture with signals: Imgur

My diagnose is; Event queue service didn't recognize event as its-own if event timeout was set in ticker handler.

tbondar commented 7 years ago

@pan- I'm afraid the patch above didn't fix the problem.

nvlsianpu commented 7 years ago

It looks like change proposed by @geky (https://github.com/ARMmbed/mbed-os/issues/3857#issuecomment-287166802) patches the bug. @tbondar could you confirm or contradict. My board didn't hang since more than 12 hours.

tbondar commented 7 years ago

@nvlsianpu OK, I'm starting a test right now. @pan- @geky @nvlsianpu Guys, do you all agree that this is probably a reasonable solution to the problem? We plan to use this code in a device that's going out to our customer and we don't have much time left to do long term tests. Thanks for all your work so far!

geky commented 7 years ago

I created a pr for review (#3991), hopefully this is the fix

This patch makes the non-rtos and rtos versions of equeue_sema behave consistently. The rtos version has the main support of mbed OS, so there should be fewer issues outside of the equeue_sema layer.

tbondar commented 7 years ago

@nvlsianpu Unfortunately my board still stopped blinking after 8200 seconds. What is the blinking period that you use? Please note, that I changed the original 500 ms to 50 ms, as stated in the original bug report.

One more interesting addition is that sometimes the blinking restarts spontaneously after some time. I will try to collect more information about this.

nvlsianpu commented 7 years ago

I contradict as well - https://github.com/ARMmbed/mbed-os/pull/3991 didn't fix this issue (but it still fixes other issue :D). I reproduced issue witch changes as well.

nvlsianpu commented 7 years ago

I prepared patch https://github.com/ARMmbed/mbed-os/pull/4016. I'm waiting now for verification using board on my desk. cc @tbondar , @pan-

nvlsianpu commented 7 years ago

I confirm that PR 4016 mentioned above fixes this issue.

tbondar commented 7 years ago

@nvlsianpu I can confirm too: it works. Many thanks for getting this fixed!

yogeshk19 commented 7 years ago

Has this fix made into the main branch, if not what branch of the mbed-os is this fix available?

Thanks, Yogesh

nvlsianpu commented 7 years ago

@yogeshk19 PR is still pending. It is possible to pull this from its source repo.

yogeshk19 commented 7 years ago

@nvlsianpu Thanks for taking the time to respond. Do I have to pull the entire the mbed-os or can I just get the changed files that effect the above issue. I am currently using the 5.2.3 version of mbed-os.

Also Is there a timeline as to when this PR would be completed and added to specific mbed-os branch? We are trying to figure out our beta testing dates and just was hoping to get an idea as to when this fix would be made into an official mbed-os 5 branch

Thanks, Yogesh

nvlsianpu commented 7 years ago

Do I have to pull the entire the mbed-os or can I just get the changed files that effect the above issue.

It is one file fix - you can just copy us_ticker.c

I can't to respond to second question. It's fix so it should be merged to the master ASAP.

yogeshk19 commented 7 years ago

@nvlsianpu - Thanks for getting back to me. I will just use that file for now. How would I know if this has been committed to one of the official branches. Just check the PR status?

Thanks, Yogesh

pan- commented 7 years ago

@yogeshk19 you can check the PR status. I've just accepted the PR content it should be a matter of few days now.

I must add that there is also a bug affecting every time related primitive (Timer, Ticker, Timeout, ...) if the timeout requested exceeds 2147 seconds. I've made a PR for it: (#4094), it is still in progress but I have good hopes that it will be integrated soon.

nvlsianpu commented 7 years ago

Shall we close this issue?

tbondar commented 7 years ago

@nvlsianpu I suggest to keep it open till v5.4.4 is released. If it's not reproducible with that then I think we can close it. (For our production code we moved from mbed to Nordic SDK, so this problem doesn't have direct effect on our product any more.)

yogeshk19 commented 7 years ago

So this change hasn't been merged into 5.4.3 that was released a week ago? If not when is 5.4.4 going to be out, so these change is merged into an official release?

Thanks, Yogesh

nvlsianpu commented 7 years ago

Patch release are published every 2 weeks. So It will happened in a week.

ghost commented 7 years ago

The fix should be available in a release now right?

nvlsianpu commented 7 years ago

Yes: https://github.com/ARMmbed/mbed-os/releases/tag/mbed-os-5.4.4

nvlsianpu commented 7 years ago

This should be closed. @tbondar @0xc0170

lpercifield commented 7 years ago

Hey all, I've run into an problem that is possibility related to this issue. I'm using an mbed implementation from RedBear on the nRF52 module. Code here Using their BLE peripheral example combined with a ticker example, I'm seeing the ticker fail almost immediately after startup. I've updated the us_ticker.c file to reflect the changes in the 5.4.4 release, with no luck. Without the softdevice running, the ticker works perfectly. Any guidance or suggestions are welcome.

Keep up the great work!

adbridge commented 7 years ago

@lpercifield Can you raise a new issue against mbed-os detailing your set up , tool versions etc , mbed-os versions etc and we can take another look. Thanks