[Code size] Emit __mspabi_func_epilog instead of a sequence of 'pop' instructions

access-softek / llvm-project

Other

0 stars 0 forks source link

[Code size] Emit __mspabi_func_epilog instead of a sequence of 'pop' instructions #1

Closed chbessonova closed 11 months ago

chbessonova commented 5 years ago

MSP430 EABI (see p. 3.8) provides __mspabi_func_epilog_X functions set to replace a sequence of pop instructions at the end of a function in order to reduce code size. GCC uses the approach, __mspabi_func_epilog_X is implemented in libgcc.

For example,

30 40 00 00     br  #0x0000     ;
        R_MSP430_16 __mspabi_func_epilog_7

instead of

34 41           pop r4      ;
35 41           pop r5      ;
36 41           pop r6      ;
37 41           pop r7      ;
38 41           pop r8      ;
39 41           pop r9      ;
3a 41           pop r10     ;

cr1901 commented 4 years ago

@chbessonova Has there been any movement to implement this optimization that hasn't been pushed here yet? Especially in LLVM, I would find this optimization valuable for optimizing msp430 sizes in Rust. This is the last big optimization I can think of that gcc does that LLVM does not.

asl commented 4 years ago

@cr1901 Not that I'm aware of. Though it is useful, yes.

asl commented 4 years ago

Note that we could further improve this via implementing machine outliner. This is access-softek/llvm-project#7 for the record

cr1901 commented 4 years ago

@asl Good to know. This was on my radar as part of optimizing msp430 Rust. But if you know how to make it more general, then it's probably better that I follow your progress there and see if I can test changes (or, if I can understand msp430-llvm, make some myself :))...

atrosinenko commented 4 years ago

@cr1901 In case someone else could be interested in reviewing this msp430-specific patch: https://reviews.llvm.org/D84397

cr1901 commented 4 years ago

@atrosinenko My attempt at testing is blocked pending cargo fixes. I will get to it when possible.

cr1901 commented 4 years ago

@atrosinenko

Before Patching

Rust commit 09f4c9f5082f78, LLVM commit 86b120e6f30, which corresponds to LLVM 10.

`-Os`

msp430-elf-size target/msp430-none-elf/release/at2xt
   text    data     bss     dec     hex filename
   1998       4      46    2048     800 target/msp430-none-elf/release/at2xt

Patching

patch -Np1 < D84397.diff

patching file llvm/lib/Target/MSP430/CMakeLists.txt
patching file llvm/lib/Target/MSP430/MSP430.h
Hunk #1 succeeded at 43 (offset 1 line).
patching file llvm/lib/Target/MSP430/MSP430CommonEpilogueOptimizer.cpp
patching file llvm/lib/Target/MSP430/MSP430InstrInfo.cpp
Hunk #2 succeeded at 194 (offset 12 lines).
Hunk #3 succeeded at 212 (offset 12 lines).
patching file llvm/lib/Target/MSP430/MSP430TargetMachine.cpp
patching file llvm/test/CodeGen/MSP430/asm-clobbers.ll
Hunk #1 FAILED at 13.
Hunk #2 FAILED at 42.
2 out of 2 hunks FAILED -- saving rejects to file llvm/test/CodeGen/MSP430/asm-clobbers.ll.rej
patching file llvm/test/CodeGen/MSP430/mspabi-func-epilog-multi-terminators.mir
patching file llvm/test/CodeGen/MSP430/mspabi-func-epilog-special-cases.ll
patching file llvm/test/CodeGen/MSP430/mspabi-func-epilog.ll

I also changed MSP430::R4 back to MSP430::FP in MSP430CommonEpilogueOptimizer.cpp.

After Patching

`-Os`

msp430-elf-size target/msp430-none-elf/release/at2xt
 text    data     bss     dec     hex filename
 2008       4      46    2058     80a target/msp430-none-elf/release/at2xt

`-Oz`

In general -Oz is unreliable for me to get smaller code.

msp430-elf-size target/msp430-none-elf/release/at2xt
   text    data     bss     dec     hex filename
   2040       4      46    2090     82a target/msp430-none-elf/release/at2xt

Hmmm... binary size increased! Would it help if I supplied the resulting llvm and emitted asm? The epilog functions are being emitted and used. But I don't really understand why there's a size increase.

asl commented 4 years ago

I guess the reason is that only few epilog functions are emitted, but the whole set is included in the assembly file. Anyway, can you attach assemblies?

cr1901 commented 4 years ago

Anyway, can you attach assemblies?

I have attached for different assemblies in this file: at2xt-builds.zip.

The below info is mainly for me just in case I forget, but others might find it useful.

File Layout

Each build directory- Os, Oz, epilog-Os, epilog-Oz- has the same files:

at2xt- Final ELF file.
at2xt.o- Object file before linking.
at2xt.ll- LLVM source.
at2xt.lst- Listing of the ELF file from objdump.
at2xt.sym- Symbol table.
at2xt.o.lst- Listing of the object file from objdump. (Sometimes it's more useful to me because label names are still intact.)
at2xt.reloc- Relocation table.

Build Command

Built with AT2XT commit de9b431 and a Rust toolchain compiled as per my previous comment.

Each build has the same command-line invocation, with the following differences during the actual build:

The linked line was changed from opt-level = "s" to opt-level = "z" and vice versa to test both optimization levels.
llvm was recompiled with and without the patch described in my previous comment.

This makes 4 combinations, hence 4 directories of builds.

cargo rustc -Zbuild-std=core --release --target=msp430-none-elf -- --emit=obj=target/msp430-none-elf/release/at2xt.o,llvm-ir=target/msp430-none-elf/release/at2xt.ll
msp430-elf-objdump -Cd target/msp430-none-elf/release/at2xt > target/msp430-none-elf/release/at2xt.lst
msp430-elf-readelf -a --wide target/msp430-none-elf/release/at2xt > target/msp430-none-elf/release/at2xt.sym
msp430-elf-objdump -Cd target/msp430-none-elf/release/at2xt.o > target/msp430-none-elf/release/at2xt.o.lst
msp430-elf-readelf -r --wide target/msp430-none-elf/release/at2xt.o > target/msp430-none-elf/release/at2xt.reloc
msp430-elf-size target/msp430-none-elf/release/at2xt

atrosinenko commented 4 years ago

I cannot say anything specific on -Os vs. -Oz right now: it is out of scope for this patch but is definitely worth further investigations. When looking at *.o files only generated at -Os optimization level, I agree with @asl: from what I see, there are 4 two-byte branches that just changed their destinations (due to change of size) as well as three epilogues replaced. Two of them actually do not changed in size: 2+2 bytes (pop + ret) were replaced with 4 bytes (br). One other was actually shortened from 5*2 = 10 bytes to 4 only (so, 6 bytes less).

How the code size is actually changed:

an epilogue sequence with N pop instructions followed by a single ret consumes (N + 1) * 2 bytes
any shortened epilogue is exactly 4 bytes long
thus, the epilogue is never made larger than it was, but the shortest original epilogue sequence remains 4 bytes long
the helper function, on the other hand, takes 16 bytes (technically, it is the largest epilogue sequence, for N = 7, with branches pointing at some instruction inside it)
jmp instruction could be emitted instead of br for near-enough jumps, but we cannot say for sure whether it is applicable until linking

$ llvm-objdump -d Os/at2xt.o > Os-dump.txt
$ llvm-objdump -d epilog-Os/at2xt.o > epilog-Os-dump.txt
$ sed --in-place -r 's/[ \t]*[0-9a-z]+:/\tXYZ:/' *.txt
$ diff --side-by-side --suppress-common-lines *.txt
epilog-Os/at2xt.        XYZ:    file format ELF32-msp430      | Os/at2xt.       XYZ:    file format ELF32-msp430
        XYZ: 30 40 00 00                        br      #0    |         XYZ: 3a 41                              pop     r10
                                                              >         XYZ: 30 41                              ret
        XYZ: 03 24                              jeq     $+8   |         XYZ: 06 24                              jeq     $+14
        XYZ: 30 40 00 00                        br      #0    |         XYZ: 37 41                              pop     r7
                                                              >         XYZ: 38 41                              pop     r8
                                                              >         XYZ: 39 41                              pop     r9
                                                              >         XYZ: 3a 41                              pop     r10
                                                              >         XYZ: 30 41                              ret
        XYZ: f5 23                              jne     $-20  |         XYZ: f2 23                              jne     $-26
        XYZ: d0 27                              jeq     $-94  |         XYZ: cd 27                              jeq     $-100
        XYZ: cc 3f                              jmp     $-102 |         XYZ: c9 3f                              jmp     $-108
        XYZ: 30 40 00 00                        br      #0    |         XYZ: 3a 41                              pop     r10
                                                              >         XYZ: 30 41                              ret

cr1901 commented 4 years ago

sed --in-place -r 's/[ \t]*[0-9a-z]+:/\tXYZ:/' *.txt

I must at least mention- this is an excellent regex that I'm stealing for future comparisons :).

jmp instruction could be emitted instead of br for near-enough jumps, but we cannot say for sure whether it is applicable until linking.

I have LTO enabled for these builds. Is it still impossible to know whether the jmps are close enough (either through LTO or linker relaxation)? Does the MSP430 ABI doc say anything about whether br must be used? Of course, that leads to another point you mention:

the helper function, on the other hand, takes 16 bytes (technically, it is the largest epilogue sequence, for N = 7, with branches pointing at some instruction inside it)

Okay, so in my code as-is, the overhead of __mspabi_func_epilog will always exceed the savings. Is it possible in LTO mode to dead-code-eliminate the branches which aren't used? I'm guessing it's not possible right now, since AFAIU we still require gcc for the link step (lgcc provides __mspabi_func_epilog), and even if TI's gcc had libraries compiled with LTO IR, cross-toolchain LTO doesn't work.

On the same token, I imagine it's not possible to know a priori how big the helper function is in the gcc toolchain, even though the "obvious" way would be- and is- 16 bytes.

But pretending we had a fully LLVM toolchain for a moment:

Is there anything in principle that forbids dead code elimination of the __mspabi_func_epilog symbols according to the MSP430 ABI?
Could there be an option to automatically "revert" (if this is possible in LLVM) __mspabi_func_epilog emission if it's determined that there isn't a net savings?

Since these are longer term questions, is it possible to add an option to disable the epilog generation for those who know from benchmarking that it doesn't save space?

atrosinenko commented 4 years ago

I have LTO enabled for these builds. Is it still impossible to know whether the jmps are close enough (either through LTO or linker relaxation)?

As I understand, LTO (at least the LLVM one) operates on the IR level while the __mspabi_func_epilog_N are most probably implemented in assembly. Still the fact that linker relaxation did not handle this out of the box was not so expected for me when I tried it yesterday...

Does the MSP430 ABI doc say anything about whether br must be used?

I hope this is irrelevant unless the code jumps to the correct place.

Okay, so in my code as-is, the overhead of __mspabi_func_epilog will always exceed the savings.

Looks like you had not gained anything just because you got a whole 16-bytes-long common epilogue sequence while actually shortened one epilogue only. Since you have only a constant 16 bytes increase, the more epilogues are shortened, the more bytes you save - but in your case you saved just 6 bytes so the code was actually enlarged by 10 bytes...

On other questions, the MSP430 EABI document says the expected implementation is:

__mspabi_func_epilog:
__mspabi_func_epilog_7: POP R4
__mspabi_func_epilog_6: POP R5
__mspabi_func_epilog_5: POP R6
__mspabi_func_epilog_4: POP R7
__mspabi_func_epilog_3: POP R8
__mspabi_func_epilog_2: POP R9
__mspabi_func_epilog_1: POP R10
                        RET

So, this implementation looks absolutely straightforward, but right now I have no idea on what common machinery could eliminate the unneeded "uppermost" part of this function. Better understanding of linker abilities could probably help here.

Could there be an option to automatically "revert" (if this is possible in LLVM) __mspabi_func_epilog emission if it's determined that there isn't a net savings?

A conservative way would be to somehow refrain from this optimization altogether unless we could definitely save at least 16 bytes for this particular LLVM module. This could prevent some relevant optimization possibilities in a traditional setup (compiling many object files then linking) but should fit well-enough for LTO case.

cr1901 commented 4 years ago

A conservative way would be to somehow refrain from this optimization altogether unless we could definitely save at least 16 bytes for this particular LLVM module. This could prevent some relevant optimization possibilities in a traditional setup (compiling many object files then linking) but should fit well-enough for LTO case.

This seems reasonable for my use case.

Next Steps

That being said, I should probably compare the Rust implementation to the savings in the semi-equivalent C code implementation (I have the C toolchain from TI installed too). I think I will do that this weekend. Totally escaped my mind to try this and compare/contrast. :)

I can divide testing into three categories:

C code

No bounds checks
No disabling interrupts to access globals.
No ref counting on globals.
Nothing equivalent to Rust's panic.
Almost certainly will take the least space. GCC does take advantage of __mspabi_func_epilog emission in my testing.
Test GCC and LLVM toolchain at -Os with LTO, with and without __mspabi_func_epilog emission. You cannot "opt" out of __mspabi_func_epilog emission for GCC.

Unsafe Rust

This is Rust written in a style that allows UB to occur (using unsafe). Some UB in Rust isn't necessarily UB in C, like data races. I can turn the Idiomatic Rust code below into Unsafe style in a few minutes.

No bounds checks
No disabling interrupts to access globals.
No ref counting on globals.
panic can still happen.
Return types mostly equivalent to C code.
Approx 5-10% inherent overhead compared to C code.
Test LLVM toolchain at -Os with LTO, with and without __mspabi_func_epilog emission.

Idiomatic Rust

Bounds checks
Interrupts disabled to access globals.
Ref counting on globals when appropriate.
Almost all return types are tagged unions.
From experience, 20-25% overhead.
Test LLVM toolchain at -Os with LTO, with and without __mspabi_func_epilog emission. This is already in my .zip file :)

atrosinenko commented 4 years ago

the helper function, on the other hand, takes 16 bytes (technically, it is the largest epilogue sequence, for N = 7, with branches pointing at some instruction inside it)

Okay, so in my code as-is, the overhead of __mspabi_func_epilog will always exceed the savings.

@cr1901 Could you please explain why overhead is always greater than savings for your code? IIUC this pass always adds 16 bytes of code (I cannot say for sure now whether it has any alignment requirements) that can be used from anywhere. Aside from that, it should never increase the code size. Meanwhile, is there something that prevents emitting the traditional epilogue sequences (that could be replaced by this optimization) except for not using too many registers?

@cr1901 @asl I would prefer not to implement heuristics for LTO case. On the other hand, adding an -fsomething switch should be rather simple, provided its value is more significant than cluttering the LLVM code base and documentation. Meanwhile, the aforementioned heuristics probably needs some switch as well to handle the case of many small translation units properly.

cr1901 commented 4 years ago

Aside from that, it should never increase the code size.

The problem is that my code gets inlined so much thanks to LTO that in 2kB there are only 4 or so epilogs. So the "aside from that" (the up front 16 byte cost of including __mspabi_func_epilog) is, in fact, what's increasing the total code side.

Meanwhile, is there something that prevents emitting the traditional epilogue sequences (that could be replaced by this optimization) except for not using too many registers?

Can you elaborate on what you mean by "except for not using too many registers"? Would not using too many registers cause a function to become a prime candidate for inlining? Other than a lot of inline expansion thanks to LTO opportunities (and without LTO, this code won't fit :)...), I don't know why my code has so few epilog sequences.

atrosinenko commented 4 years ago

The problem is that my code gets inlined so much thanks to LTO that in 2kB there are only 4 or so epilogs.

Now got it, thanks :)

Can you elaborate on what you mean by "except for not using too many registers"?

I mean "if a function clobbers many callee-saved registers, then it usually saves them in prologue and restores them in epilogue". The second part can usually be replaced by a single 4-byte BR while the original sequence takes (1 + <restored registers count>) x 2 bytes. If function restores only a few registers, then savings are small as well. I thought maybe it generates some less traditional, custom, sequences to handle callee-saved registers but it would be rather surprising as it should be handled completely in the LLVM backend.

Would not using too many registers cause a function to become a prime candidate for inlining?

I'm not familiar with inlining heuristics but anyway I'm interested in functions after inlining took place. It looks somewhat surprising to me that (large?) functions produced after inlining need to restore so few registers. It probably needs some further investigation on my side.

cr1901 commented 4 years ago

It looks somewhat surprising to me that (large?) functions produced after inlining need to restore so few registers.

This is why I wanted to compare the old C version of my AT2XT firmware to the Rust version- so you can have a point of comparison in a deployed application. As you can see, it's been... taking me a while to add compare functionality :P. Also, to reiterate, the C source is not quite semantically equivalent to the Rust code, and Rust code has extra features. This may make the C code savings look better than it should be.

It probably needs some further investigation on my side.

If I provide you a script/insns, would you be willing to build Rust on your side for comparison purposes? Although I opt to patch the in-tree LLVM with your patches, you can target an external LLVM.

Personally, I am fine using C still. But a lot of ppl in my immediate circle aren't anymore. And I carefully monitor msp430 Rust for size regressions so it can remain a viable target anywhere C can :)!

atrosinenko commented 4 years ago

If I provide you a script/insns, would you be willing to build Rust on your side for comparison purposes?

I am not experienced with Rust, so it may be faster to apply this patch on your side. On the other hand, if you already have such a script "ready to publish", then it may be useful for me or someone else...

cr1901 commented 4 years ago

@atrosinenko ~Okay I'm fed up trying to make this work... it seems that gcc and clang are rather different for msp430. how does one create C code for MSP430 that works on both GCC and Clang with the minimal amount of changes required between them?~

EDIT: I figured it out. But the epilogs are still not present. They are, however, present in my Rust code. I'm using the msp430-epilog branch of my LLVM repo (which is tracking LLVM 10, currently).

I have a self-contained example here. All relevant header files and linker script are included, and the code compiles on both clang and gcc. But ~interrupt vectors are not being linked in, and~ interestingly enough, none of the epilog routines are being linked in on the clang version. For the purposes of keeping things simple, I disabled LTO for both gcc and clang for now. Compared to Rust (no external deps in the C code, for instance), LTO doesn't make a significant difference.

cr1901 commented 4 years ago

I noticed that the __crt0_init_bss and __crt0_movedata functions weren't being included either in the binary emitted when running clang. This most likely has to do with the .refsym dance I describe in #16.

By including the following top-level ASM statements:

__asm(".refsym __crt0_init_bss");
__asm(".refsym __crt0_movedata");

I force the startup code to be linked in. Not only that, but now all the __mspabi_func_epilog code gets linked in as well, despite it not being used by the application proper. ~I'm not sure how to interpret this...~

EDIT: Appears to not be unique to clang. Compiling the code with gcc also has the same effect of including dead code. I'm clearly doing something I'm not supposed to as a workaround.

asl commented 4 years ago

@cr1901

EDIT: I figured it out. But the epilogs are still not present. They are, however, present in my Rust code. I'm using the msp430-epilog branch of my LLVM repo (which is tracking LLVM 10, currently).

All the work done was in LLVM mainline. It does not make any sense to use LLVM 10, 5, 3.4 or any other version

cr1901 commented 4 years ago

All the work done was in LLVM mainline. It does not make any sense to use LLVM 10, 5, 3.4 or any other version

(Fortunately, Rust started tracking release/11.x last week, so it's not too far behind mainline/12.x...)

Using Rust commit 08deb863bde, and LLVM/Clang commit be8127459ec, aka PR #8, I compiled a new Rust and Clang that are both tracking LLVM mainline with __mspabi_func_epilog emission. I needed to use the below patches to get Rust and Clang to compile, but I assume those have negligible effects on MSP430 code emission.

I then ran some size benchmarks using my AT2XT repo to see the effect on binary size.

Benchmarks

TL;DR version: Rust code is pessimized by 12 bytes using this patch because the 16-bytes of epilog code are linked in when this patch is present. Clang code remains the same size; the epilog code is linked in to both binaries with and without the patch. This is because memmove from newlib libc uses mspabi_func_epilog_2. Even with the patch, clang itself isn't emitting any epilog code for my firmware!

Clang

Build with: make TOOLCHAIN=clang in legacy-src directory of the repo, with include.mk vars set appropriately to point to the custom toolchain. make clean in between builds.

Without epilog patch, mspabi_func_epilog_2 is still used as part of memmove, which is used as part of __crt0_movedata, and thus the epilog code still gets linked in. Of course, memmove was compiled using gcc :).

msp430-elf-size at2xt.elf
   text    data     bss     dec     hex filename
   1680       4      36    1720     6b8 at2xt.elf

With epilog patch, nothing changes.:

msp430-elf-size at2xt.elf
   text    data     bss     dec     hex filename
   1680       4      36    1720     6b8 at2xt.elf

Rust

Build with: cargo build -Zbuild-std=core --release --target=msp430-none-elf at the root of the repo, with rustup override set to point to the custom toolchain. cargo clean in between builds.

Without epilog patch, looks like we got some savings here and there since LLVM 10/11 :D! I think I should do some cherry-picking later.

msp430-elf-size target/msp430-none-elf/release/at2xt
   text    data     bss     dec     hex filename
   1960       4      46    2010     7da target/msp430-none-elf/release/at2xt

With epilog patch, mspabi_func_epilog_1 and mspabi_func_epilog_3 are both used once. The jumps use 4 and 4 bytes, respectively. Space savings is therefore (8 - 4) + (4 - 4) - 16 = -12 bytes saved. The up-front cost of adding the 16-byte epilog code block dominates.

msp430-elf-size target/msp430-none-elf/release/at2xt
   text    data     bss     dec     hex filename
   1972       4      46    2022     7e6 target/msp430-none-elf/release/at2xt

GCC

mspabi_func_epilog_2, mspabi_func_epilog_3, and mspabi_func_epilog_4 are all jumped to once, respectively. The jumps use 2, 2, and 4 bytes, respectively. Space savings is therefore (10 - 4) + (8 - 2) + (6 - 2) - 16 = 0 bytes saved. So, we break even. memmove isn't linked in using gcc because the data section has no data.

msp430-elf-size at2xt.elf
   text    data     bss     dec     hex filename
   1572       0      36    1608     648 at2xt.elf

Patches

Rust

Between LLVM 11 and mainline (b13b85818), StandardInstrumentations constructor started requiring a parameter:

diff --git a/src/rustllvm/PassWrapper.cpp b/src/rustllvm/PassWrapper.cpp
index 76fe5e7f769..c84193c8737 100644
--- a/src/rustllvm/PassWrapper.cpp
+++ b/src/rustllvm/PassWrapper.cpp
@@ -754,7 +754,7 @@ LLVMRustOptimizeWithNewPassManager(
   PTO.SLPVectorization = SLPVectorize;

   PassInstrumentationCallbacks PIC;
-  StandardInstrumentations SI;
+  StandardInstrumentations SI(false);
   SI.registerCallbacks(PIC);

   if (LlvmSelfProfiler){

LLVM

Standalone Clang build was broke for about a month or so; the epilog patch was created while Clang builds were still broken:

diff --git a/clang/tools/libclang/CMakeLists.txt b/clang/tools/libclang/CMakeLists.txt
index 9b34682cc49..a4077140ace 100644
--- a/clang/tools/libclang/CMakeLists.txt
+++ b/clang/tools/libclang/CMakeLists.txt
@@ -68,7 +68,12 @@ endif ()

 if (HAVE_LIBDL)
   list(APPEND LIBS ${CMAKE_DL_LIBS})
-endif()
+elseif (CLANG_BUILT_STANDALONE)
+  find_library(DL_LIBRARY_PATH dl)
+  if (DL_LIBRARY_PATH)
+    list(APPEND LIBS dl)
+  endif ()
+endif ()

 option(LIBCLANG_BUILD_STATIC
   "Build libclang as a static library (in addition to a shared one)" OFF)