Writing to KVStore on DISCO-F769NI sets timers that are used on Mbed OS event queue out of sync

mikaleppanen commented 4 years ago

Description of defect

Mbed OS event queue equeue_tick() function calls one of two time functions based on whether call is made from interrupt or thread:

unsigned equeue_tick()
{
    using unsigned_ms_t = std::chrono::duration<unsigned, std::milli>;

    unsigned_ms_t d;
#if defined MBED_TICKLESS || !MBED_CONF_RTOS_PRESENT
    // It is not safe to call get_ms_count from ISRs, both
    // because documentation says so, and because it will give
    // a stale value from the RTOS if the interrupt has woken
    // us out of sleep - the RTOS will not have updated its
    // ticks yet.
    if (core_util_is_isr_active()) {
        // And the documentation further says that this
        // should not be called from critical sections, for
        // performance reasons, but I don't have a good
        // current alternative!
        d = std::chrono::duration_cast<unsigned_ms_t>(mbed::internal::os_timer->get_time().time_since_epoch());
    } else {
        d = rtos::Kernel::Clock::now().time_since_epoch();
    }
#else
    // And this is the legacy behaviour - if running in
    // non-tickless mode, this works fine, despite Mbed OS
    // documentation saying no. (Most recent CMSIS-RTOS
    // permits `ososKernelGetTickCount` from IRQ, and our
    // `rtos::Kernel` wrapper copes too).
    d = rtos::Kernel::Clock::now().time_since_epoch();
#endif
    return d.count();
}

On the error case, after several KVstore writes, the mbed::internal::os_timer->get_time() returns a time value ahead of the rtos::Kernel::Clock::now().time_since_epoch() call.

This results that when equeue->call() is made from an interrupt, the event gets time value ahead of the time value used when processing the event. Then when processing the event, the event queue waits for the time difference to elapse before triggering the event, although the event should have been triggered right away.

Here is an error log from included application that can be used to repeat the error:

kv_set
cnt: 11100 cb delay: 12 us, os t: 111254 ms, rtos t: 111254 ms <-------- OK, delay before event queue call is 12 microseconds and timers are in sync (11254 ms)
kv_set
cnt: 11400 cb delay: 13 us, os t: 114260 ms, rtos t: 114260 ms
cnt: 11700 cb delay: 12 us, os t: 117267 ms, rtos t: 117267 ms
kv_set
cnt: 12000 cb delay: 13 us, os t: 120273 ms, rtos t: 120273 ms
kv_set
cnt: 12300 cb delay: 13 us, os t: 123279 ms, rtos t: 123279 ms
kv_set
cnt: 12600 cb delay: 12 us, os t: 126287 ms, rtos t: 126287 ms
kv_set
cnt: 12900 cb delay: 12 us, os t: 129293 ms, rtos t: 129293 ms
cnt: 13200 cb delay: 12 us, os t: 132299 ms, rtos t: 132299 ms
kv_set
cnt: 13500 cb delay: 13 us, os t: 135473 ms, rtos t: 135473 ms
cnt: 13507 cb delay: 511096 us, os t: 136580 ms, rtos t: 136068 ms <----- error happens, difference between timers is 512ms and delay before event queue is that same value 511096 us)
ERROR
cnt: 13800 cb delay: 511958 us, os t: 289526 ms, rtos t: 289014 ms
cnt: 14100 cb delay: 511987 us, os t: 446126 ms, rtos t: 445614 ms

The time differences seem to be multiples of 128 (256, 512, 768) etc.

Timer used on interrupt is on:

https://github.com/ARMmbed/mbed-os/blob/master/hal/source/mbed_ticker_api.c

This error was encountered on Wi-SUN Border Router, with Nanostack, Ethernet and Wi-SUN IEEE 802.15.4 radio.

Using below application, without network stack and Ethernet, the error is sometimes hard to repeat. If the network stack and Ethernet are added to the application, error happens in few KVstore writes. This happens with both LwIP and Nanostack. The application is as a default configured to enabled the LwIP and Ethernet.

Target(s) affected by this defect ?

DISCO-F769NI

Board is on dual bank mode (https://os.mbed.com/teams/ST/wiki/How-to-enable-flash-dual-bank)

Toolchain(s) (name and version) displaying this defect ?

GNU Arm Embedded version 9 (9-2019-q4-major)

What version of Mbed-os are you using (tag or sha) ?

e1d3037c17a546ff81527a27d6159f7a2e149327

What version(s) of tools are you using. List all that apply (E.g. mbed-cli)

None

How is this defect reproduced ?

Below application runs a timer which adds a callback to event queue and from the callback re-activates the timer. On the callback time difference between the timers is calculated and a trace is printed every 3 seconds. On parallel KVstore writes are made.

mbed_app.json to enable KVstore

{
    "target_overrides": {
        "*": {
            "mbed-trace.enable": true,
            "platform.stdio-baud-rate": 115200,
            "platform.stdio-convert-newlines": true,
            "platform.stdio-flush-at-exit": true,
            "platform.stdio-buffered-serial": true,
            "drivers.uart-serial-rxbuf-size": 1024
        },
        "DISCO_F769NI": {
            "target.features_add": ["STORAGE"],
            "target.components_add": ["FLASHIAP"],
            "target.flash_dual_bank" : 1,
            "storage_tdb_internal.internal_base_address": "(MBED_ROM_START+0x100000)",
            "storage_tdb_internal.internal_size": "(2*128*1024)",
            "storage.storage_type": "TDB_INTERNAL"
        }
    }
}

Application:

#include "mbed.h"
#include "kvstore_global_api.h"
#include "mbed_os_timer.h"

using namespace mbed;

void event_callback(void);
void timer_callback(void);
void write_to_kv_storage(void);

EventQueue *equeue;
Timeout timeout;
Timer timer;
int timer_cb_time = 0;
bool timer_unsync = false;

int main()
{
    printf("start\n");

    equeue = mbed_highprio_event_queue();
    timer.start();
    // 10ms timeout
    timeout.attach_us(timer_callback, 10000);

    ThisThread::sleep_for(5000);

    ///// FOR ETHERNET ENABLE ////////////
#if 1
    printf("connecting ethernet\n");
    NetworkInterface *net = NetworkInterface::get_default_instance();
    if (!net) {
        printf("no network interface\n");
        return -1;
    }
    int result = net->connect();
    if (result != 0) {
        printf("cannot connect ethernet\n", result);
        return -1;
    }
    printf("ethernet connected\n");
    ThisThread::sleep_for(5000);
#endif
    /////////////////////////////////

    // Do kv_sets until timer sync is lost
    while (true) {
        write_to_kv_storage();
        if (timer_unsync) {
            break;
        }
    }

    printf("ERROR\n");

    while (true) {
        ThisThread::sleep_for(5000);
    }
}

// Timer callback, requests event queue to call event callback
void timer_callback(void)
{
    timer_cb_time = timer.read_us();
    equeue->call(event_callback);
}

// Event callback, checks timers, traces, and starts timer again
void event_callback(void)
{
    static int counter = 0;
    int event_cb_time = timer.read_us();

    //////////// FOR Mbed OS master
    using unsigned_ms_t = std::chrono::duration<unsigned, std::milli>;
    unsigned_ms_t os_timer_ms = std::chrono::duration_cast<unsigned_ms_t>(mbed::internal::os_timer->get_time().time_since_epoch());
    unsigned_ms_t ms_count_ms = rtos::Kernel::Clock::now().time_since_epoch();
    int os_timer = os_timer_ms.count();
    int ms_count = ms_count_ms.count();
    //////////// OR FOR Mbed OS 5.15
    //int os_timer = mbed::internal::os_timer->get_time() / 1000;
    //int ms_count = rtos::Kernel::get_ms_count();
    /////////////

    bool trace = false;
    // Check if timers are out of sync more than 100ms
    if (os_timer - ms_count > 100) {
        if (!timer_unsync) {
            trace = true;
        }
        timer_unsync = true;
    }

    // Trace every 3 seconds: counter, delay between timer callback and event callback, and timers
    if (counter % 300 == 0 || trace) {
        printf("cnt: %i cb delay: %i us, os t: %i ms, rtos t: %i ms\n", counter, event_cb_time - timer_cb_time, os_timer, ms_count);
    }
    counter++;

    timeout.attach_us(timer_callback, 10000);
}

// Write 5 files with 20000 bytes of data
void write_to_kv_storage(void)
{
    printf("kv_set\n");

    int res;

    res = kv_reset("/kv/");
    if (res != MBED_SUCCESS) {
        printf("kv_reset() fail\n");
    }

    char *write_mem_block = (char *) malloc(20000);
    if (!write_mem_block) {
        printf("alloc failed!\n");
        return;
    }

    for (int index = 0; index < 5; index++) {
        char filename[30];
        sprintf(filename, "/kv/file%i", index);
        res = kv_set(filename, write_mem_block, 20000, 0);
        if (res != MBED_SUCCESS) {
            printf("kv_set() fail\n");
        }
    }

    free(write_mem_block);
}

ciarmcom commented 4 years ago

@mikaleppanen thank you for raising this issue.Please take a look at the following comments:

We cannot automatically identify a release based on the version of Mbed OS that you have provided. Please provide either a single valid sha of the form #abcde12 or #3b8265d70af32261311a06e423ca33434d8d80de or a single valid release tag of the form mbed-os-x.y.z . E.g. 'https://github.com/ARMmbed/mbed-os/commits/3a86360cc57cbb6a35c4595b01e361ae919f2' has not been matched as a valid tag or sha. NOTE: If there are fields which are not applicable then please just add 'n/a' or 'None'.This indicates to us that at least all the fields have been considered. Please update the issue header with the missing information, the issue will not be mirroredto our internal defect tracking system or investigated until this has been fully resolved.

ciarmcom commented 4 years ago

Thank you for raising this detailed GitHub issue. I am now notifying our internal issue triagers. Internal Jira reference: https://jira.arm.com/browse/IOTOSM-2516

mikaleppanen commented 3 years ago

Looks like this is caused by overflow of "uint8_t _unacknowledged_ticks;" on: https://github.com/ARMmbed/mbed-os/blob/master/platform/include/platform/internal/SysTimer.h when doing large KV store writes. If the the 8 bit value is changed to 16 bit, does not occur anymore.

0xc0170 commented 3 years ago

cc @mbed-os-core

mikaleppanen commented 3 years ago

On the above application, when KV store is initialized it takes about 800ms. The low power timer advances that time, but RTOS SysTick is not increased full amount, since when synchronizing the SysTick, the maximum for unacknowledged ticks is 255.

@0xc0170 @mbed-os-core Could correction for this be prioritized? This prevent Wi-SUN stack testing with our Wi-SUN Border Router on DISCO-F769NI reference hardware. Long KV store writes cause the event queue on Nanostack to halt. This happens e.g. on Wi-SUN Pelion Border Router when KV store is erased during power on.

@TuomoHautamaki @mikter

0xc0170 commented 3 years ago

I'll share this with the team

0xc0170 commented 3 years ago

@kjbracey-arm https://github.com/mikaleppanen/mbed-os/commit/532f8ca3c9438de6ce0755c1bb289210e763030e#diff-d04d921a0cbb394a2d6acb60d1ea52e4be0d96b540018174e294769618238b88 - this is the fix. I've noticed the comment above about being it 8bit is a feature. Is the use case wisun is facing worth considering, or rather than increasing the timeout, it should be fixed elsewhere?

kjbracey commented 3 years ago

This is similar to #13801, but occurring in an exceptional runtime condition rather than due to debugger interference.

As per discussion there, I think the simplest thing to do is to just increase the counter to 32-bit.

(I would recommend 32-bit over 16-bit just for code size and speed - 2 bytes of RAM isn't huge).

Letting the counter get big does mean a potentially big catch-up spin, but it's proportionate to the time you've already spent jammed. So it's not causing much more of a problem than you already had - it'll be increasing your already-occurred lock-up time by a few percent.

kjbracey commented 3 years ago

I would suggest that aside from the timer issue, surely there must be some sort of flash driver or config problem here? The dual-bank flash should mean you are able to write and erase the kvstore area without disrupting the execution from the other bank at all.

https://www.st.com/resource/en/application_note/dm00266999-stm32f7-series-flash-memory-dual-bank-mode-stmicroelectronics.pdf

"The dual bank Flash memory allows a code to be executed in one bank, while another bank is being erased or programmed. It avoids a CPU stalling during programming operations and protects the system from power failures or other errors."

Where's this going wrong?

teetak01 commented 3 years ago

In this case application is in Bank1, and kvstore in Bank2? Usually the kvstore operations are fast, and infrequent enough, that it should not cause noticeably interruptions.

kjbracey commented 3 years ago

That's the theory. This operation is taking 800ms, which is expected, but apparently blocking interrupts for that period, which isn't expected, if in the other bank, as it looks like it should be.

ARMmbed / mbed-os