abcminiuser / lufa

LUFA - the Lightweight USB Framework for AVRs.
http://www.lufa-lib.org
1.03k stars 321 forks source link

DFU bootloader vs GCC compiler (9.1.0 and 8.3.0) #149

Closed zygmunt closed 4 years ago

zygmunt commented 5 years ago

Hi,

After 3 months of stagnation Im back on my little hobby project I tried compile DFU bootloader for atmega32u4, and it fails with error:

/usr/bin/avr-ld: section .apitable_trampolines LMA [0000000000007fa0,0000000000007fb7] overlaps section .data LMA [0000000000007f9c,0000000000008003] collect2: error: ld returned 1 exit status

Im taking it for: lufa/Bootloaders/DFU. In makefile I changed: MCU = atmega32u4, FLASH_SIZE_KB = 32, BOOT_SECTION_SIZE_KB = 4.

I tried also atmega16u2 (flash size changed to 16) and WOW no problem. Hopefully I figured out that I used avr-gcc-8.3.0 and everything is ok.

And now the question is: Any idea what have been changed in compiler 9.1.0?

abcminiuser commented 5 years ago

Sounds like the newer compiler produces slightly larger code than the older one, and it no longer fits into 4KB. I'll have to see if there is anything else obvious I can do to try to squeeze down the bootloader sizes with the newer version of GCC.

zygmunt commented 5 years ago

If compiling for atmega16u2 will fail too then trying resizing code is imho required. But it works, only atmega32u4 fails. I saw also that fails for some other chip in gcc bugzilla. I doesnt saw anything new in gcc changelog. Maybe that is a bug.

abcminiuser commented 5 years ago

The U4 series parts require larger binaries due to their internal register maps and USB controller feature sets, so the compilation fitting in 4KB for the U2 parts but not the U4 parts is not unexpected.

I've just grabbed the latest AVR-GCC 9.1.0 binaries and compared them against my existing 8.1.0 install. It looks like the newer version of the compiler is a bit more pessimistic, generating 146 bytes more in FLASH in the DFU bootloader than the older revision under the same conditions.

Looking at the assembly one obvious difference is around some of the function prologs and epilogs, e.g.:

GCC 8.1.0:

void BootloaderAPI_FillWord(const uint32_t Address, const uint16_t Word)
{
    boot_page_fill_safe(Address, Word);
    6964:   07 b6           in  r0, 0x37    ; 55
    6966:   00 fc           sbrc    r0, 0
    6968:   fd cf           rjmp    .-6         ; 0x6964 <BootloaderAPI_FillWord>
    696a:   f9 99           sbic    0x1f, 1 ; 31
    696c:   fe cf           rjmp    .-4         ; 0x696a <BootloaderAPI_FillWord+0x6>
    696e:   21 e0           ldi r18, 0x01   ; 1
    6970:   fb 01           movw    r30, r22
    6972:   0a 01           movw    r0, r20
    6974:   20 93 57 00     sts 0x0057, r18 ; 0x800057 <__TEXT_REGION_LENGTH__+0x7e0057>
    6978:   e8 95           spm
    697a:   11 24           eor r1, r1
}

GCC 9.1.0:

void BootloaderAPI_FillWord(const uint32_t Address, const uint16_t Word)
{
    69f2:   0f 93           push    r16
    69f4:   1f 93           push    r17
    69f6:   8b 01           movw    r16, r22
    69f8:   ca 01           movw    r24, r20
    boot_page_fill_safe(Address, Word);
    69fa:   07 b6           in  r0, 0x37    ; 55
    69fc:   00 fc           sbrc    r0, 0
    69fe:   fd cf           rjmp    .-6         ; 0x69fa <BootloaderAPI_FillWord+0x8>
    6a00:   f9 99           sbic    0x1f, 1 ; 31
    6a02:   fe cf           rjmp    .-4         ; 0x6a00 <BootloaderAPI_FillWord+0xe>
    6a04:   41 e0           ldi r20, 0x01   ; 1
    6a06:   f8 01           movw    r30, r16
    6a08:   0c 01           movw    r0, r24
    6a0a:   40 93 57 00     sts 0x0057, r20 ; 0x800057 <__TEXT_REGION_LENGTH__+0x7e0057>
    6a0e:   e8 95           spm
    6a10:   11 24           eor r1, r1
}
    6a12:   1f 91           pop r17
    6a14:   0f 91           pop r16
    6a16:   08 95           ret

The newer version is doing some odd things compare to the old, pushing/popping some temporary registers just so it can copy over the 16-bit argument from r20/r22 into r16/r24 instead of just using them directly. That certainly looks like a regression from the older compiler version, but I've got no idea why it would choose this.

abcminiuser commented 5 years ago

More testing (it's been a while since I poked at AVRs). It would appear that GCC 8 onwards finally has working Link Time Optimization which can be a huge win for size. Can you try adding LTO=Y to the makefile and see if it works for you? It compiles quite a bit smaller for me, which is an improvement from previous versions where it would either increase the overall size or crash the linker.

zygmunt commented 5 years ago

Wow, yeah that works:)

`avr-size --mcu=atmega32u4 --format=avr BootloaderDFU.elf AVR Memory Usage

Device: atmega32u4

Program: 3958 bytes (12.1% Full) (.text + .data + .bootloader)

Data: 179 bytes (7.0% Full) (.data + .bss + .noinit) `

Thank you very much.

PS: I didnt test binary version. Maybe I will have a chance do it today.