btashton commented 3 years ago

Why is the Build Larger

Intro

There was some discussion of the growth in code size between 9.1.0 and 10.0.0-RC0. We have some tools to help us look into this, so I did a little digging.

To be clear I am not making any statement here about if we should do something about it or not, but I wanted us to at least be aware of what caused these increases.

Discovery

For reference I am using the stm32f4discovery:nsh as the reference target because it seems to be a common target for people that has been around for a long time.

So first off here is the top level size information without with the standard configuration out of the box build which includes optimisation.

❯ arm-none-eabi-size ../testdata/opt/nuttx-*
   text    data     bss     dec     hex filename
  69947     104    2104   72155   119db ../testdata/opt/nuttx-10.0.0-RC0
  69447     104    2476   72027   1195b ../testdata/opt/nuttx-9.1.0

.text sees an increase of 500 bytes and .data for -372. These are not huge, but are also not insignificant and but we cannot keep increasing by this much every release. The decrease in .data is nice, but likely from the watchdog change so there will likely be a little increase in the dynamic allocation.

There are a lot of changes between these two releases, but we can first analyse the debug information to get a better idea what is going on. I added with kconfig-tweak the debug symbols and the used Bloaty to generate these reports

Here is the quick script I used to generate the builds I wanted to look at:

set -e +x
NUTTX_TAG=$1
APPS=apps
NUTTX=incubator-nuttx
OUTDIR=testdata
for REPO in $NUTTX $APPS
do
  pushd $REPO
  git clean -xdf
  git checkout $NUTTX_TAG
  popd
done
pushd $NUTTX
./tools/configure.sh stm32f4discovery:nsh
kconfig-tweak --enable DEBUG_SYMBOLS
make oldconfig
make -j8
popd
mkdir -p $OUTDIR
cp $NUTTX/nuttx $OUTDIR/$NUTTX_TAG

This is the comparison between 10.0.0-RC0 and 9.1.0.

❯ bloaty -n 20 -d sections,symbols ../testdata/nuttx-10.0.0-RC0 -- ../testdata/nuttx-9.1.0
    FILE SIZE        VM SIZE    
 --------------  -------------- 
  +1.8% +16.2Ki  [ = ]       0    .debug_info
  +0.7% +1.64Ki  [ = ]       0    .debug_line
  +3.8% +1.13Ki  [ = ]       0    .debug_str
  +0.6%    +636  +0.6%    +636    .text
    +1.9%    +400  +1.9%    +400    [103 Others]
    [NEW]    +284  [NEW]    +284    mm_memalign
    [NEW]    +232  [NEW]    +232    nxtask_init
    [NEW]    +212  [NEW]    +212    nxsem_clockwait
     +83%    +158   +83%    +158    fclose
     +61%    +132   +61%    +132    nsh_session
    [NEW]     +96  [NEW]     +96    cxx_initialize
     +44%     +76   +44%     +76    nx_vioctl
     +95%     +72   +95%     +72    proc_findnode
     +19%     +64   +19%     +64    nx_vopen
    +238%     +62  +238%     +62    inode_search
    [NEW]     +62  [NEW]     +62    sem_clockwait
   -72.0%     -72 -72.0%     -72    wd_initialize
    [DEL]     -76  [DEL]     -76    up_cxxinitialize
   -46.2%     -96 -46.2%     -96    nxthread_create
   -66.7%    -120 -66.7%    -120    isspace
   -76.7%    -138 -76.7%    -138    nsh_system
    [DEL]    -148  [DEL]    -148    wd_create
    [DEL]    -152  [DEL]    -152    wd_delete
   -40.4%    -194 -40.4%    -194    opendir
   -87.9%    -218 -87.9%    -218    nxsem_timedwait
  +0.9%    +432  [ = ]       0    .symtab
    +1.9%    +384  [ = ]       0    [section .symtab]
    [NEW]     +48  [ = ]       0    _inode_getcwd
    [NEW]     +48  [ = ]       0    cxx_initialize
    [NEW]     +48  [ = ]       0    inode_root_reserve
    [NEW]     +48  [ = ]       0    memalign
    [NEW]     +48  [ = ]       0    nxsem_clockwait
    [NEW]     +48  [ = ]       0    nxtask_init
    [NEW]     +48  [ = ]       0    nxtask_uninit
    [NEW]     +32  [ = ]       0    _assert
    [NEW]     +32  [ = ]       0    inited.5681
    +200%     +32  [ = ]       0    inode_search
    [NEW]     +32  [ = ]       0    mm_memalign
    [DEL]     -32  [ = ]       0    g_wdpool
   -66.7%     -32  [ = ]       0    inode_insert
   -66.7%     -32  [ = ]       0    inode_unlink
   -40.0%     -32  [ = ]       0    nx_start
   -66.7%     -32  [ = ]       0    nx_stat
    [DEL]     -48  [ = ]       0    up_cxxinitialize
    [DEL]     -48  [ = ]       0    wd_create
    [DEL]     -48  [ = ]       0    wd_delete
    -9.5%    -112  [ = ]       0    [56 Others]
  +0.3%    +374  [ = ]       0    .debug_abbrev
  +1.9%    +224  [ = ]       0    .debug_aranges
  +0.5%    +188  [ = ]       0    .debug_frame
  +0.9%    +170  [ = ]       0    .strtab
    +1.3%     +80  [ = ]       0    [section .strtab]
    [NEW]     +32  [ = ]       0    nxsem_clockwait_uninterruptible
    [NEW]     +19  [ = ]       0    inode_root_reserve
    [NEW]     +18  [ = ]       0    __FUNCTION__.5716
    [NEW]     +18  [ = ]       0    __FUNCTION__.6102
    [NEW]     +18  [ = ]       0    __FUNCTION__.6110
    [NEW]     +18  [ = ]       0    __FUNCTION__.6118
    [NEW]     +18  [ = ]       0    __FUNCTION__.6128
    [NEW]     +16  [ = ]       0    nxsem_clockwait
    [NEW]     +15  [ = ]       0    cxx_initialize
    [NEW]     +15  [ = ]       0    nxtask_startup
    [NEW]     +14  [ = ]       0    _inode_getcwd
    [NEW]     +14  [ = ]       0    nxtask_uninit
    -7.2%      -5  [ = ]       0    [13 Others]
    [DEL]     -13  [ = ]       0    g_wdfreelist
    [DEL]     -17  [ = ]       0    up_cxxinitialize
    [DEL]     -18  [ = ]       0    __FUNCTION__.5606
    [DEL]     -18  [ = ]       0    __FUNCTION__.5991
    [DEL]     -18  [ = ]       0    __FUNCTION__.5999
    [DEL]     -18  [ = ]       0    __FUNCTION__.6007
    [DEL]     -18  [ = ]       0    __FUNCTION__.6019
  +0.4%     +36  +0.4%     +36    .ARM.extab
  +0.4%     +24  +0.4%     +24    .ARM.exidx
 -29.4%      -5  [ = ]       0    [Unmapped]
  -5.6%     -24  [ = ]       0    .debug_ranges
  [ = ]       0 -14.8%    -372    .bss
    [ = ]       0  +8.9%     +20    g_idletcb
    [ = ]       0  [NEW]      +4    inited.5681
    [ = ]       0 -13.3%      -2    [section .bss]
    [ = ]       0  [DEL]      -2    g_wdnfree
    [ = ]       0  [DEL]      -8    g_wdfreelist
    [ = ]       0  [DEL]    -384    g_wdpool
  +1.3% +20.9Ki  +0.3%    +324    TOTAL

Analysis

From this we can see some wins around the watchdog, although that is likely at the cost of some dynamic allocations.

The new symbols are good places for us to look for the growth. Some of these are places were logic has been re-factored an example may be where cxx_initialized is new and up_cxxinitialize is old.

Other places we see large features being brought in such as mm_memalign. Lets take a look at this Here are some that jumped out to me:

mm_memalign (new)
nxtask_init (new)
nxsem_clockwait (new)
fclose
nsh_session
proc_findnode

To dig into what these builds actually look like I am using Cutter https://cutter.re/ which is mostly just a frontend for Radare2.

`mm_memalign` (+284)

This is a new symbol included in the build, but a function that has existed for a long time. We can look at the global call graph to see what calls it: It is only called by up_create_stack

We can then look at the local call graph between the two releases for up_create_stack 10.0.0-RC0 9.1.0

We have made a change between the releases that is not calling memalign instead of malloc. Taking a look a the diff between the releases we see:

 #else /* CONFIG_TLS_ALIGNED */
@@ -163,14 +172,16 @@ int up_create_stack(FAR struct tcb_s *tcb, size_t stack_size, uint8_t ttype)

       if (ttype == TCB_FLAG_TTYPE_KERNEL)
         {
-          tcb->stack_alloc_ptr = (uint32_t *)kmm_malloc(stack_size);
+          tcb->stack_alloc_ptr =
+              (uint32_t *)kmm_memalign(CONFIG_STACK_ALIGNMENT, alloc_size);
         }
       else
 #endif
         {
           /* Use the user-space allocator if this is a task or pthread */

-          tcb->stack_alloc_ptr = (uint32_t *)kumm_malloc(stack_size);
+          tcb->stack_alloc_ptr =
+              (uint32_t *)kumm_memalign(CONFIG_STACK_ALIGNMENT, alloc_size);
         }
 #endif /* CONFIG_TLS_ALIGNED */

This change came in as part of this PR https://github.com/apache/incubator-nuttx/pull/1562

`proc_findnode` (+72)

This increase seems to have come in as part of this https://github.com/apache/incubator-nuttx/pull/1490 This is a minor code change to fix a bug, but the generated code is surprisingly large.

`fclose` (+158)

This largely comes from code complexity change that came in as https://github.com/apache/incubator-nuttx/pull/1611 Once again the change looks small, but ends up creating a fair bit of code.

`nsh_session` (+132)

Used the same technique as before looking at the call graph. We have a quite a bit of complexity increase here that came in with https://github.com/apache/incubator-nuttx-apps/pull/367 which adds the ability to pass in arguments to nsh. I suspect this functionality might have been best held behind a config flag. The file stream change added some code size here, but very little.

Also to be fair we got some saving in the change on nsh_system (-138) so this may have come out as a net gain in code savings.

`nxtask_init` (+232)

(TODO)

`nxsem_clockwait` (+212)

(TODO)

btashton commented 3 years ago

@patacongo @xiaoxiang781216 @acassis @davids5 for awareness. None of these seem like "bugs" and are well intentioned, but it is something we should be aware of. The memalign one is the big one, I am not sure if there is something we can do here, I was not very involved in the TLS changes.

patacongo commented 3 years ago

@patacongo @xiaoxiang781216 @acassis @davids5 for awareness. None of these seem like "bugs" and are well intentioned, but it is something we should be aware of. The memalign one is the big one, I am not sure if there is something we can do here, I was not very involved in the TLS changes.

The build for this particular configuration increases because mm_memalign() was not included in the build before and is now drawn into the build because new code in up_create_stack() now references it.

The only way to avoid that would be to avoid calling mm_memalign() (or any of its variants in the different namespaces).

Other configurations that already used mm_memalign() would, therefore, not see any code growth but most simple configurations will certainly be impacted. So I don't think the problem is really so significant, but does increase the base size of most minimal configurations. That is unfortunate primarily for marketing reasons since it makes the OS seem much large in most size investigations.

patacongo commented 3 years ago

The only way to avoid that would be to avoid calling mm_memalign() (or any of its variants in the different namespaces).

The previous code avoided this code growth by using a simple trick: It just allocated memory using mm_malloc() family functions and adjusted the base address and size to accomplish the alignment. This does not add code size because mm_malloc() is always present. mm_memalign(), on the other hand, is a larger, complex function that is not normally included in the link.

btashton commented 3 years ago

Yeah I agree. I mostly just wanted to put this together so there was some level of due diligence on the release and we were not just saying looks like code size is larger ¯\_(ツ)_/¯ I run my script against some other configurations including some network related ones, and found the increase to be similar. Most of them did not use memalign before so this is likely a cost in most configurations, and most of the file related functions that deal with streams gained a little bit of code size each which added up to ~200-300 bytes new code and was counteracted by some other clean-up in common code.

patacongo commented 3 years ago

It would be nice to know about code size issues at the time the PR is generated vs. he 11th hour before a release.

patacongo commented 3 years ago

Are you considering a "Code Size" section in the release note?

btashton commented 3 years ago

It would be nice to know about code size issues at the time the PR is generated vs. he 11th hour before a release.

I agree. This is part of what I am working on, we now store all the binaries from the CI runs so we can do analysis on them, but I'm trying to figure out what the right thing to do to show the results.

Showing the code size change for 200 configurations in a github comment is not helpful, but I might be able to highlight configurations where there has been a high percentage change or new/deleted symbols across all the builds.

For core OS features I imagine we can get a lot of benefit just looking at something like ostest configurations on a couple architectures. It may not be super representative of real world configurations, but I suspect it will bring things out.

I'll see what I can come up with and then we can iterate on it.

Are you considering a "Code Size" section in the release note?

Probably not a bad idea going forward. It also shows that it is something we are prioritizing which I think is important.

patacongo commented 3 years ago

Hmm.. for the case of ARM with CONFIG_MM_SMALL=n, the memory returned by mm_malloc() will already be aligned correctly to 8-byte address boundaries. That is required because the allocated memory must be capable of holding aligned long long and aligned double types in allocated structures. So for this particular case, switching to mm_memalign() is an unfortunate waste and completely unnecessary.

For the case of ARM EABI, the alignment requirements of the stack and and all memory allocated via the mm_malloc() family functions are identical.

davids5 commented 3 years ago

@btashton Thank you this is great engineering and analysis!

It adds a tremendous level of understanding to the cause and effect of the changes in the release. It would make sense to run bloaty on some representative config's in the PR.

Here is a thought on how to present it: We could have some config permutations just for size comparison. We would enable everything (we can) and have a bloaty.ld linker scripts with multi Megs of memory, so we can always link. Of course these configs would need a disclaimer not to use them.

btashton commented 3 years ago

It adds a tremendous level of understanding to the cause and effect of the changes in the release. It would make sense to run bloaty on some representative config's in the PR.

Here is a thought on how to present it: We could have some config permutations just for size comparison. We would enable everything (we can) and have a bloaty.ld linker scripts with multi Megs of memory, so we can always link. Of course these configs would need a disclaimer not to use them.

That is actually an interesting thought. I was already looking at possibly adding some configurations that are very feature heavy to make it easier to get some more build coverage as well. I have a configuration that I use locally on one of the STM32 with a lot of ram/flash that has net/usb/adc/sensors/display/etc... all enabled which also makes it easier for me to do functional testing without having to spend so much time flashing everything. There are other boards like the imxrt where we could get create some functional large configurations.

I spent some time putting some more tooling together and I think I have an idea on how to go about adding some useful visualization to this so I'll share something early next week. We actually already have over 250 runs against commits merged in the master branch each with around 590 builds collected and that is only since the end of October, so I suspect I can show some interesting graphs from that.

patacongo commented 3 years ago

We would enable everything (we can) ...

For cases like the mm_memalign issue, minimal configurations should code growth better than large configurations. When new, previously unused code is drawn into a minimal configuration, the code growth can be substantial.

Are we trying to protect the minimal configurations like nsh or detect size changes in large complex configurations.

xiaoxiang781216 commented 3 years ago

Brennan, It's a great analysis! Why not to create a tips section in document and put this analysis there?

On Sat, Nov 28, 2020 at 5:54 AM patacongo notifications@github.com wrote:

We would enable everything (we can) ...

For cases like the mm_memalign issue, minimal configurations should code growth better than large configurations. When new, previously unused code is drawn into a minimal configuration, the code growth can be substantial.

Are we trying to protect the minimal configurations like nsh or detect size changes in large complex configurations.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/apache/incubator-nuttx/issues/2425#issuecomment-735233847, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEJ24FDALJ2SRYQESJ6PGILSSD6J5ANCNFSM4UFL6G6A .

btashton commented 3 years ago

I am going to close this out since there is nothing directly actionable here. I have been capturing the artefacts for the last few months and am going to be publishing a tool to visualize them a little. Maybe then we will be able to get that hooked into the CI more.

acassis commented 3 years ago

@btashton about nsh_session (+132) Should we open an Issue to add a config flag to enable disable nsh parameter?

acassis commented 3 years ago

I am going to close this out since there is nothing directly actionable here. I have been capturing the artefacts for the last few months and am going to be publishing a tool to visualize them a little. Maybe then we will be able to get that hooked into the CI more.

I agree! Maybe we need to setup an automatic way to "pull the plug" when the binary increases more than 5%. Did you have some idea about how we could do it?

btashton commented 3 years ago

@btashton about nsh_session (+132) Should we open an Issue to add a config flag to enable disable nsh parameter?

It's more complicated then that. One symbol grew the other got smaller because of the change. This net increase is really just how memalign got pulled in which as mentioned in real configs is likely added anyway.

Anyway we can discuss what to do with the tooling when I have it ready, right now I have just been capturing data and trying to automate some of what I did here.

apache / nuttx

Code size growth between 9.1.0 and 10.0.0-RC0 #2425

Why is the Build Larger

Intro

Discovery

Analysis

`mm_memalign` (+284)

`proc_findnode` (+72)

`fclose` (+158)

`nsh_session` (+132)

`nxtask_init` (+232)

`nxsem_clockwait` (+212)

apache / nuttx

Code size growth between 9.1.0 and 10.0.0-RC0 #2425

Why is the Build Larger

Intro

Discovery

Analysis

mm_memalign (+284)

proc_findnode (+72)

fclose (+158)

nsh_session (+132)

nxtask_init (+232)

nxsem_clockwait (+212)

`mm_memalign` (+284)

`proc_findnode` (+72)

`fclose` (+158)

`nsh_session` (+132)

`nxtask_init` (+232)

`nxsem_clockwait` (+212)