InfiniTimeOrg / InfiniTime

Firmware for Pinetime smartwatch written in C++ and based on FreeRTOS
GNU General Public License v3.0
2.72k stars 928 forks source link

FLASH and RAM memory analysis #313

Closed JF002 closed 3 years ago

JF002 commented 3 years ago

Now that we've just released version 1.0 of this project, I think it would be useful to take some time to analyze the memory (RAM and FLASH) usage. Indeed, both memory are nearly full right now : firmware is ~410KB out of the 460KB available, and it's becoming harder and harder to find RAM memory to allocate buffers when needed.

FLASH memory usage should be fairly easy to map using the .map file.

RAM memory usage is a bit more difficult to figure out. I've already done a bit of researches :

Mapping memory usage will allow us to see what's need to be optimized and which solutions we can implement to be able to continue to add new features and applications.

ObiKeahloa commented 3 years ago

Can we try reducing the FreeRTOS heap size or is it already at its smallest.

We could also try to remake NimBLE but that would require a lot of handwork and isn't guaranteed to succeed.

kieranc commented 3 years ago

This shows the reduction in the DFU file size that I saw when disabling each of the following components:

image

DavidVentura commented 3 years ago

I did some quick analysis using a fork of linkermapviz

  105596 liblvgl.a
  101668 libnimble.a
   19443 lv_font_navi_80.c.o
   18872 libnrf-sdk.a
   14428 bg_clock.c.o
   11901 libc_nano.a
    9682 Navigation.cpp.o
    7422 SystemInfo.cpp.o
    7080 libgcc.a
    6780 bma423.c.o

I uploaded the results of running linkermapviz vs develop: Full and without nimble and lvgl (they take 50% of space by themselves and skew perspective)

ObiKeahloa commented 3 years ago

We could try to remake the heavy hitters (The Navigation and AnalogWatchface) from scratch and see if it helps.

ObiKeahloa commented 3 years ago

About NimBLE , we could again try replacing parts of it with our own custom made variant or we could try to completely ditch it and make our own stack based on it.(It does seem to use a WHOLE lot of memory)

DavidVentura commented 3 years ago

it is possible that we are linking the entire static libs while we only use part of them.

i spent a while playing with LTO to try and shake the unused bits off, but i did not succeed (managed to get linking working but the resulting file was larger, so probably it both mangled LTO and disabled linker GC pass)

ObiKeahloa commented 3 years ago

it is possible that we are linking the entire static libs while we only use part of them -- i spent a long while playing with LTO to try and shake the unused bits off, but it did not work (linking worked but the resulting bin file was larger)

True , is there anyway to find out which static libs are being used?

DavidVentura commented 3 years ago

Check the links i posted, the ones ending in .a are static

On Tue, 4 May 2021 at 13:06, MysteriousLog6 @.***> wrote:

it is possible that we are linking the entire static libs while we only use part of them -- i spent a long while playing with LTO to try and shake the unused bits off, but it did not work (linking worked but the resulting bin file was larger)

True , is there anyway to find out which static libs are being used?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/JF002/InfiniTime/issues/313#issuecomment-831926329, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA33I3TO26S2G5P6LBNMXATTL7WMNANCNFSM43PEJMVQ .

-- Stack is the new term for "I have no idea what I'm actually using".

Avamander commented 3 years ago

About NimBLE , we could again try replacing parts of it with our own custom made variant or we could try to completely ditch it and make our own stack based on it.(It does seem to use a WHOLE lot of memory)

Absolutely not, I'm very against this. This is not a maintenance burden that's reasonable to take.

Avamander commented 3 years ago

it is possible that we are linking the entire static libs while we only use part of them -- i spent a long while playing with LTO to try and shake the unused bits off, but it did not work (linking worked but the resulting bin file was larger)

As discussed, incorrect linker/compiler flags or their combinations, but yes it is difficult to achieve correct results. Correct LTO doesn't add size.

Avamander commented 3 years ago

We could try to remake the heavy hitters (The Navigation and AnalogWatchface) from scratch and see if it helps.

No, that's unlikely to help in any significant way. Just store the used resources on external storage. Code is very very dense and ultra rarely the actual issue (that automated tools can't optimize).

DavidVentura commented 3 years ago

it is possible that we are linking the entire static libs while we only use part of them -- i spent a long while playing with LTO to try and shake the unused bits off, but it did not work (linking worked but the resulting bin file was larger)

As discussed, incorrect linker/compiler flags or their combinations, but yes it is difficult to achieve correct results. Correct LTO doesn't add size.

yea that was implied in my message - now i see it was unclear; sorry for the noise. will update the comment

ObiKeahloa commented 3 years ago

We could try to remake the heavy hitters (The Navigation and AnalogWatchface) from scratch and see if it helps.

No, that's unlikely to help in any significant way. Just store the used resources on external storage. Code is very very dense and ultra rarely the actual issue (that automated tools can't optimize).

About NimBLE , we could again try replacing parts of it with our own custom made variant or we could try to completely ditch it and make our own stack based on it.(It does seem to use a WHOLE lot of memory)

Absolutely not, I'm very against this. This is not a maintenance burden that's reasonable to take.

Well , is there anyway we can reduce the memory usage of NimBLE? I do understand why your against an entirely new Bluetooth stack , removing parts of it might work.

DavidVentura commented 3 years ago

Compiling with size optimization (-Os) brings ~80KB savings (see this comment for details), down to ~320KB. I can't notice any performance difference when using the watch.

Another 35-40KB can be moved to flash (fonts + analog clock bg), although adding LittleFS support would bring ~20KB making net gains ~20KB

ObiKeahloa commented 3 years ago

Compiling with size optimization (-Os) brings ~80KB savings (see this comment for details), down to ~320KB. I can't notice any performance difference when using the watch.

Another 35-40KB can be moved to flash (fonts + analog clock bg), although adding LittleFS support would bring ~20KB making net gains ~20KB

So if we move it to the SPI flash we save 30-40 (Let's say 35Kb) and then for LittleFS we have to take another 20kb so about 15Kb.

JF002 commented 3 years ago

Thanks everyone for your work so far! :+1:

@MysteriousLog6 I'm don't think that rewriting part or the whole BLE stack would be a good idea. The BLE stack a is very complex piece of code and I doubt we would be able to do better than NimBLE or NRF SoftDevice (the 2 BLE stacks I've worked with so far, and they both need significant amount of flash an ram space). In my opinion, a better way to handle that would be to contribute to NimBLE directly if there are some parts of the code you can improve so that we'll take profit of these optimizations when we upgrade to a more recent version for NimBLE.

You've also identified 2 apps (navigation and analog watchface) that use a lot of flash space. Rewriting them probably won't help, as most of the space is taken by graphical assets. As @Avamander said, the most straightforward solution would be to store them in the external flash memory (4MB).

Regarding the use of static libraries, I thought that the linker would automatically remove parts of the code that were never used but... we might want to check that!

@DavidVentura Thanks for your tools and measurements, I'll definitely have a better look at them as soon as I have some time! It looks like compiling/linking with -Os might be a first easy step to free a significant amount of flash space!

Avamander commented 3 years ago

Regarding the use of static libraries, I thought that the linker would automatically remove parts of the code that were never used but... we might want to check that!

Some, but not all. LTO takes it to a next level, but is not trivial to do correctly. I've mentioned that, somewhere, been a long week.

It looks like compiling/linking with -Os might be a first easy step to free a significant amount of flash space!

It just has to be monitored that necessary fast paths are not sacrificed (e.g. display driver speed).

JF002 commented 3 years ago

Linkermapviz is a really great tool, thanks, @DavidVentura ! I modified the script to generate a .csv file of all the symbols from BSS, DATA, RODATA and TEXT sections of the code (on commit 9ab298c09e273479822d10aad9f7bfe1d287ce75). This is just quick'n'dirty, so here's the code : init.py.txt

and the resulting csv file: output.csv

Sections:

-> FLASH usage = TEXT + RODATA + DATA -> RAM usage = DATA + BSS + heap + Stack

Here are my observations regarding RAM usage:

To be continued:

JF002 commented 3 years ago

Here another article about Tools for Firmware Code Size Optimization which links to puncover, which looks reaaaaaally interesting!

Puncover is really easy to install:

image

ObiKeahloa commented 3 years ago

Maybe this will help:

https://github.com/apache/mynewt-nimble/issues/950

JF002 commented 3 years ago

I did a deep analysis of the usage of the buffer dedicated for lvgl (managed by lv_mem). This buffer is used by lvgl to allocated memory for drivers (display/touch), screens, themes, and all widgets created by the apps.

The usage of this buffer can be monitored using this code :

lv_mem_monitor_t mon;
lv_mem_monitor(&mon);
NRF_LOG_INFO("\t Free %d / %d -- max %d", mon.free_size, mon.total_size, mon.max_used);

The most interesting metric is mon.max_used which specifies the maximum number of bytes that were used from this buffer since the initialization of lvgl. According to my measurements, initializing the theme, display/touch driver and screens cost 4752 bytes! Then, initializing the digital clock face costs 1541 bytes. For example a simple lv_label needs ~140 bytes of memory.

I tried to monitor this max value while going through all the apps of InfiniTime 1.1 : the max value I've seen is 5660 bytes. It means that we could probably reduce the size of the buffer from 14KB to 6 - 10 KB (we have to take the fragmentation of the memory into account).

JF002 commented 3 years ago

I also noticed that LVGL allows to specify a custom memory manager instead of the default one implemented in lv_mem. That would allow us to use the FreeRTOS memory manager for LVGL. This way, we would have to allocate only 1 memory buffer for the RTOS and LVGL, thus reducing the overhead (both buffers are a bit bigger than actually needed).

https://github.com/JF002/InfiniTime/blob/develop/src/libs/lv_conf.h#L74

ObiKeahloa commented 3 years ago

I also noticed that LVGL allows to specify a custom memory manager instead of the default one implemented in lv_mem. That would allow us to use the FreeRTOS memory manager for LVGL. This way, we would have to allocate only 1 memory buffer for the RTOS and LVGL, thus reducing the overhead (both buffers are a bit bigger than actually needed).

https://github.com/JF002/InfiniTime/blob/develop/src/libs/lv_conf.h#L74

But would we have to increase the buffer size or remain the same.

JF002 commented 3 years ago

But would we have to increase the buffer size or remain the same.

We would probably have to increase the size of the buffer allocated to freertos, but also remove the buffer allocated to lvgl. For now, both buffer are allocated with a size probably a bit larger than necessary (= overhead). If we use only one buffer, maybe we can reduce that overhead. To be analyzed :)

ObiKeahloa commented 3 years ago

But would we have to increase the buffer size or remain the same.

We would probably have to increase the size of the buffer allocated to freertos, but also remove the buffer allocated to lvgl. For now, both buffer are allocated with a size probably a bit larger than necessary (= overhead). If we use only one buffer, maybe we can reduce that overhead. To be analyzed :)

Is the overhead necessary in the future let's say while updating to LVGL 8?

So it would be nice to still keep a small overhead to be safe.

JF002 commented 3 years ago

Yes, of course... but a (one) small overhead is better than 2 :) Anyway, this is just an idea, I don't know if that'll work.

And we will probably have to review the memory allocation as we add features and update components and libraries, this is not written in stone.

The issue we are facing right now is that it's really difficult to add anything in the firmware because there's almost no RAM/Flash memory available.

JF002 commented 3 years ago

"Global" stack analysis:

This stack will be used for everything except tasks, which have their own stack allocated by FreeRTOS. The stack is 8192B and is allocated in the linker script. An easy way to monitor its usage is by filling the section with a known pattern at boot time, then use the firmware and dump the memory. You can then check the maximum stack usage by checking the address from the beginning of the stack that were overwritten.

Fill the stack section by a known pattern:

Edit /modules/nrfx/mdk/gcc_startup_nrf52.S and add the following code after the copy of the data from read only memory to RAM at around line 243:

/* Loop to copy data from read only memory to RAM.
 * The ranges of copy from/to are specified by following symbols:
 *      __etext: LMA of start of the section to copy from. Usually end of text
 *      __data_start__: VMA of start of the section to copy to.
 *      __bss_start__: VMA of end of the section to copy to. Normally __data_end__ is used, but by using __bss_start__
 *                    the user can add their own initialized data section before BSS section with the INTERT AFTER command.
 *
 * All addresses must be aligned to 4 bytes boundary.
 */
    ldr r1, =__etext
    ldr r2, =__data_start__
    ldr r3, =__bss_start__

    subs r3, r3, r2
    ble .L_loop1_done

.L_loop1:
    subs r3, r3, #4
    ldr r0, [r1,r3]
    str r0, [r2,r3]
    bgt .L_loop1

.L_loop1_done:

/* Add this code to fill the stack section with 0xFFEEDDBB */
ldr     r0, =__StackLimit
    ldr     r1, =8192
ldr     r2, =0xFFEEDDBB
.L_fill:
str     r2, [r0]
adds    r0, 4
subs    r1, 4
bne     .L_fill
/* -- */

Dump RAM memory and check usage

Dumping the content of the ram is easy using JLink debugger and nrfjprog:

nrfjprog --readram ram.bin

You can then display the file using objdump:

hexdump ram.bin -v  |less

The stack is positionned at the end of the RAM -> 0xFFFF. Its size is 8192 bytes, so the end of the stack is at 0xE000. On the following dump, the maximum stack usage is 520 bytes (0xFFFF - 0xFDF8):

000fdb0 ddbb ffee ddbb ffee ddbb ffee ddbb ffee
000fdc0 ddbb ffee ddbb ffee ddbb ffee ddbb ffee
000fdd0 ddbb ffee ddbb ffee ddbb ffee ddbb ffee
000fde0 ddbb ffee ddbb ffee ddbb ffee ddbb ffee
000fdf0 ddbb ffee ddbb ffee ffff ffff c24b 0003
000fe00 ffff ffff ffff ffff ffff ffff 0000 0000
000fe10 0018 0000 0000 0000 0000 0000 fe58 2000
000fe20 0000 0000 0000 00ff ddbb 00ff 0018 0000
000fe30 929c 2000 0000 0000 0018 0000 0000 0000
000fe40 92c4 2000 0458 2000 0000 0000 80e7 0003
000fe50 0000 0000 8cd9 0003 ddbb ffee ddbb ffee
000fe60 00dc 2000 92c4 2000 0005 0000 929c 2000
000fe70 007f 0000 feb0 2000 92c4 2000 feb8 2000
000fe80 ddbb ffee 0005 0000 929c 2000 0000 0000
000fe90 aca0 2000 0000 0000 0028 0000 418b 0005
000fea0 02f4 2000 001f 0000 0000 0000 0013 0000
000feb0 b5a8 2000 2199 0005 b5a8 2000 2201 0005
000fec0 b5a8 2000 001e 0000 0000 0000 0013 0000
000fed0 b5b0 2000 0fe0 0006 b5a8 2000 0000 0000
000fee0 0013 0000 2319 0005 0013 0000 0000 0000
000fef0 0000 0000 3b1c 2000 3b1c 2000 d0e3 0000
000ff00 4b70 2000 54ac 2000 4b70 2000 ffff ffff
000ff10 0000 0000 1379 0003 6578 2000 0d75 0003
000ff20 6578 2000 ffff ffff 0000 0000 1379 0003
000ff30 000c 0000 cfeb 0002 39a1 2000 a824 2000
000ff40 0015 0000 cfeb 0002 39a1 2000 a824 2000
000ff50 39a1 2000 0015 0000 001b 0000 b4b9 0002
000ff60 0000 0000 a9f4 2000 4b70 2000 0d75 0003
000ff70 4b70 2000 ffff ffff 0000 0000 1379 0003
000ff80 ed00 e000 a820 2000 1000 4001 7fc0 2000
000ff90 7f64 2000 75a7 0001 a884 2000 7b04 2000
000ffa0 a8c0 2000 0000 0000 0000 0000 0000 0000
000ffb0 7fc0 2000 7f64 2000 8024 2000 a5a5 a5a5
000ffc0 ed00 e000 3fd5 0001 0000 0000 72c0 2000
000ffd0 0000 0000 72e4 2000 3f65 0001 7f64 2000
000ffe0 0000 2001 0000 0000 ef30 e000 0010 0000
000fff0 7fc0 2000 4217 0001 3f0a 0001 0000 6100

According to my experimentations, we don't use the stack that much, and 8192 bytes is probably way too big for InfiniTime!

JF002 commented 3 years ago

"Global" heap usage

The heap is declared in the linker script and its current size is 8192 bytes. The heap is used for dynamic memory allocation(malloc(), new,...).

Heap monitoring is not easy, but it seems that we can use the following code to know the current usage of the heap:

auto m = mallinfo();
NRF_LOG_INFO("heap : %d", m.uordblks);

According to my experimentation, InfiniTime uses ~6000bytes of heap most of the time. Except when the Navigation app is launched, where the heap usage increases to... more than 9500 bytes (meaning that the heap overflows and could potentially corrupt the stack!!!). This is a bug that should be fixed in #362.

To know exactly what's consuming heap memory, you can wrap functions like malloc() into your own functions. In this wrapper, you can add logging code or put breakpoints:

extern "C" {
void *__real_malloc (size_t);
void* __wrap_malloc(size_t size) {
  return __real_malloc(size);
}
}

Now, your function __wrap_malloc() will be called instead of malloc(). You can call the actual malloc from the stdlib by calling __real_malloc().

Using this technique, I was able to trace all malloc calls at boot (boot -> digital watchface):

Interesting articles : https://www.embedded.com/mastering-stack-and-heap-for-system-reliability-part-1-calculating-stack-size/ https://www.embedded.com/mastering-stack-and-heap-for-system-reliability-part-2-properly-allocating-stacks/ https://www.embedded.com/mastering-stack-and-heap-for-system-reliability-part-3-avoiding-heap-errors/

ObiKeahloa commented 3 years ago

"Global" heap usage

The heap is declared in the linker script and its current size is 8192 bytes. The heap is used for dynamic memory allocation(malloc(), new,...).

Heap monitoring is not easy, but it seems that we can use the following code to know the current usage of the heap:

auto m = mallinfo();
NRF_LOG_INFO("heap : %d", m.uordblks);

According to my experimentation, InfiniTime uses ~6000bytes of heap most of the time. Except when the Navigation app is launched, where the heap usage increases to... more than 9500 bytes (meaning that the heap overflows and could potentially corrupt the stack!!!). This is a bug that should be fixed in #362.

To know exactly what's consuming heap memory, you can wrap functions like malloc() into your own functions. In this wrapper, you can add logging code or put breakpoints:

  • Add -Wl,-wrap,malloc to the cmake variable LINK_FLAGS of the target you want to debug (pinetime-app, most probably)
  • Add the following code in main.cpp
extern "C" {
void *__real_malloc (size_t);
void* __wrap_malloc(size_t size) {
  return __real_malloc(size);
}
}

Now, your function __wrap_malloc() will be called instead of malloc(). You can call the actual malloc from the stdlib by calling __real_malloc().

Using this technique, I was able to trace all malloc calls at boot (boot -> digital watchface):

  • system task = 3464 bytes (SystemTask could potentially be declared as a global variable to avoid heap allocation here)
  • string music = 31 (maybe we should not use std::string when not needed, as it does heap allocation)
  • ble_att_svr_start = 1720
  • ble gatts start = 40 + 88
  • ble ll task = 24
  • display app = 104
  • digital clock = 96 + 152
  • hr task = 304

Interesting articles : https://www.embedded.com/mastering-stack-and-heap-for-system-reliability-part-1-calculating-stack-size/ https://www.embedded.com/mastering-stack-and-heap-for-system-reliability-part-2-properly-allocating-stacks/ https://www.embedded.com/mastering-stack-and-heap-for-system-reliability-part-3-avoiding-heap-errors/

About the overflow of stack is this somehow related to #327 , this could probably solve two problems at once.

JF002 commented 3 years ago

@MysteriousLog6

About the overflow of stack is this somehow related to #327 , this could probably solve two problems at once.

Yes, it could be the cause of the issue, indeed. Any other memory issue could also cause that kind of error, though... I've just tried to open/close SystemInfo 20-30x on current develop without any crash.

JF002 commented 3 years ago

I summarized this analysis in the following PR : https://github.com/JF002/InfiniTime/pull/411

kartikCypherock commented 1 year ago

I also noticed that LVGL allows to specify a custom memory manager instead of the default one implemented in lv_mem. That would allow us to use the FreeRTOS memory manager for LVGL. This way, we would have to allocate only 1 memory buffer for the RTOS and LVGL, thus reducing the overhead (both buffers are a bit bigger than actually needed).

https://github.com/JF002/InfiniTime/blob/develop/src/libs/lv_conf.h#L74

`

I did a deep analysis of the usage of the buffer dedicated for lvgl (managed by lv_mem). This buffer is used by lvgl to allocated memory for drivers (display/touch), screens, themes, and all widgets created by the apps.

The usage of this buffer can be monitored using this code :

lv_mem_monitor_t mon;
lv_mem_monitor(&mon);
NRF_LOG_INFO("\t Free %d / %d -- max %d", mon.free_size, mon.total_size, mon.max_used);

The most interesting metric is mon.max_used which specifies the maximum number of bytes that were used from this buffer since the initialization of lvgl. According to my measurements, initializing the theme, display/touch driver and screens cost 4752 bytes! Then, initializing the digital clock face costs 1541 bytes. For example a simple lv_label needs ~140 bytes of memory.

I tried to monitor this max value while going through all the apps of InfiniTime 1.1 : the max value I've seen is 5660 bytes. It means that we could probably reduce the size of the buffer from 14KB to 6 - 10 KB (we have to take the fragmentation of the memory into account).

Hey, I am sorry to respond on a closed thread, but we are working on an embedded system project which utilizes the LVGL library for UI rendering and input handling. Similar to your analysis, I was able to perform memory profiling for objects using the LVGL memory monitors. For our project, it came out around 5KiB during general usage. Therefore, we allocated a static buffer to the library and are happy with the configuration.

However, we are unsure about stack depth caused by the lv_task_handler. Do you have any estimate as to how much stack each call to this function grows? LVGL has plenty of callbacks (signal_cb, event_cb), and therefore we are not able to accurately measure the stack usage. We were wondering if you have had some analysis in this direction before? It would be great to learn about the same.

We are looking at revamping our code architecture at this stage, therefore are looking to do some ground work around memory profiling. I stumbled upon your project while I was digging issues on the LVGL repository.

PS: We currently don't use any RTOS framework and don't have any plans to integrate it soon.

Thanks in advance! Looking forward to your response.

JF002 commented 1 year ago

@kartikCypherock Don't worry, that's fine to respond to this closed thread! Memory management is a complex topic, and that's exactly why I take the time to document all my findings! I'm glad to see that this can be useful to other developers :)

FYI I continue this analysis in a new post here and opened a new PR to unify all heaps here.

Regarding your question about the stack usage, I wrote this comment about the "global" stack analysis. In this context, I call the Global stack the main stack of the application (as opposed to the stack of the various FreeRTOS tasks that are running in InfiniTime). I can't remember any way to monitor the stack usage at runtime, but you can use those techniques to ensure that your application does not overflow when running specific use-cases.

LVGL is integrated in a FreeRTOS task so we can use tools from the OS to get info about the stack usage at runtime : current stack usage and the minimum amount of memory available in the stack since the beginning of the execution. This is very useful to find the best size for all the stacks.