Closed mikaleppanen closed 3 years ago
@mikaleppanen thank you for raising this issue.Please take a look at the following comments:
We cannot automatically identify a release based on the version of Mbed OS that you have provided. Please provide either a single valid sha of the form #abcde12 or #3b8265d70af32261311a06e423ca33434d8d80de or a single valid release tag of the form mbed-os-x.y.z . E.g. 'https://github.com/ARMmbed/mbed-os/commits/3a86360cc57cbb6a35c4595b01e361ae919f2' has not been matched as a valid tag or sha. NOTE: If there are fields which are not applicable then please just add 'n/a' or 'None'.This indicates to us that at least all the fields have been considered. Please update the issue header with the missing information, the issue will not be mirroredto our internal defect tracking system or investigated until this has been fully resolved.
Thank you for raising this detailed GitHub issue. I am now notifying our internal issue triagers. Internal Jira reference: https://jira.arm.com/browse/IOTOSM-2516
Looks like this is caused by overflow of "uint8_t _unacknowledged_ticks;" on: https://github.com/ARMmbed/mbed-os/blob/master/platform/include/platform/internal/SysTimer.h when doing large KV store writes. If the the 8 bit value is changed to 16 bit, does not occur anymore.
cc @mbed-os-core
On the above application, when KV store is initialized it takes about 800ms. The low power timer advances that time, but RTOS SysTick is not increased full amount, since when synchronizing the SysTick, the maximum for unacknowledged ticks is 255.
@0xc0170 @mbed-os-core Could correction for this be prioritized? This prevent Wi-SUN stack testing with our Wi-SUN Border Router on DISCO-F769NI reference hardware. Long KV store writes cause the event queue on Nanostack to halt. This happens e.g. on Wi-SUN Pelion Border Router when KV store is erased during power on.
@TuomoHautamaki @mikter
I'll share this with the team
@kjbracey-arm https://github.com/mikaleppanen/mbed-os/commit/532f8ca3c9438de6ce0755c1bb289210e763030e#diff-d04d921a0cbb394a2d6acb60d1ea52e4be0d96b540018174e294769618238b88 - this is the fix. I've noticed the comment above about being it 8bit is a feature. Is the use case wisun is facing worth considering, or rather than increasing the timeout, it should be fixed elsewhere?
This is similar to #13801, but occurring in an exceptional runtime condition rather than due to debugger interference.
As per discussion there, I think the simplest thing to do is to just increase the counter to 32-bit.
(I would recommend 32-bit over 16-bit just for code size and speed - 2 bytes of RAM isn't huge).
Letting the counter get big does mean a potentially big catch-up spin, but it's proportionate to the time you've already spent jammed. So it's not causing much more of a problem than you already had - it'll be increasing your already-occurred lock-up time by a few percent.
I would suggest that aside from the timer issue, surely there must be some sort of flash driver or config problem here? The dual-bank flash should mean you are able to write and erase the kvstore area without disrupting the execution from the other bank at all.
"The dual bank Flash memory allows a code to be executed in one bank, while another bank is being erased or programmed. It avoids a CPU stalling during programming operations and protects the system from power failures or other errors."
Where's this going wrong?
In this case application is in Bank1, and kvstore in Bank2? Usually the kvstore operations are fast, and infrequent enough, that it should not cause noticeably interruptions.
That's the theory. This operation is taking 800ms, which is expected, but apparently blocking interrupts for that period, which isn't expected, if in the other bank, as it looks like it should be.
Description of defect
Mbed OS event queue equeue_tick() function calls one of two time functions based on whether call is made from interrupt or thread:
On the error case, after several KVstore writes, the mbed::internal::os_timer->get_time() returns a time value ahead of the rtos::Kernel::Clock::now().time_since_epoch() call.
This results that when equeue->call() is made from an interrupt, the event gets time value ahead of the time value used when processing the event. Then when processing the event, the event queue waits for the time difference to elapse before triggering the event, although the event should have been triggered right away.
Here is an error log from included application that can be used to repeat the error:
The time differences seem to be multiples of 128 (256, 512, 768) etc.
Timer used on interrupt is on:
https://github.com/ARMmbed/mbed-os/blob/master/hal/source/mbed_ticker_api.c
This error was encountered on Wi-SUN Border Router, with Nanostack, Ethernet and Wi-SUN IEEE 802.15.4 radio.
Using below application, without network stack and Ethernet, the error is sometimes hard to repeat. If the network stack and Ethernet are added to the application, error happens in few KVstore writes. This happens with both LwIP and Nanostack. The application is as a default configured to enabled the LwIP and Ethernet.
Target(s) affected by this defect ?
DISCO-F769NI
Board is on dual bank mode (https://os.mbed.com/teams/ST/wiki/How-to-enable-flash-dual-bank)
Toolchain(s) (name and version) displaying this defect ?
GNU Arm Embedded version 9 (9-2019-q4-major)
What version of Mbed-os are you using (tag or sha) ?
e1d3037c17a546ff81527a27d6159f7a2e149327
What version(s) of tools are you using. List all that apply (E.g. mbed-cli)
None
How is this defect reproduced ?
Below application runs a timer which adds a callback to event queue and from the callback re-activates the timer. On the callback time difference between the timers is calculated and a trace is printed every 3 seconds. On parallel KVstore writes are made.
mbed_app.json to enable KVstore
Application: