InBetweenNames / gentooLTO

A Gentoo Portage configuration for building with -O3, Graphite, and LTO optimizations
GNU General Public License v2.0
571 stars 97 forks source link

Building the Linux kernel using LTO #90

Open InBetweenNames opened 6 years ago

InBetweenNames commented 6 years ago

I find it interesting that there hasn't been more push to build the kernel using LTO. I've found a couple of mailing list threads about it, including a patchset to let it happen, but there wasn't a lot of interest upstream. I've created this issue as a way to track what the current LTO progress in the kernel is, and possibly even add some patchsets to let it happen. I know I'd for sure use it on my router if I could with OpenWRT.

sjnewbury commented 5 years ago

@InBetweenNames I would use it on my router too, I think at the time the gcc LTO toolchain wasn't very mature and few were able too make much use of it, particularly embedded* Linux where there would be most interest. Without that buy-in the kernel devs weren't going to let the patches in.

Perhaps resurrecting the patch set and getting it working again could be successful now that lto support is pretty ubiquitous in distros and most embedded devs must be using it by now for their user space.

ionenwks commented 5 years ago

Seems some remnants of those patches are still in the kernel (notably DISABLE_LTO so it doesn't use it for vdso), so I tried with 4.19.1. Formerly used scripts/gcc-ld but didn't work for me so I used gold. I doubt it's accomplishing anything built this way (size barely changed with other defaults). Despite using gcc-ar, was also complaining about the lto plugin unless -ffat-lto. Patchset used to use -fwhole-program too but that didn't work. Nonetheless, thought I'd do the crazy thing and build the kernel with:

make -j8 AR=gcc-ar NM=gcc-nm LD=ld.gold KCFLAGS="-march=native -O3 -falign-functions=32 -fipa-pta -fno-semantic-interposition -fgraphite-identity -floop-nest-optimize -flto=8 -ffat-lto-objects" DISABLE_LTO=-fno-lto

Which.. worked.. and booted fine. I am now the proud owner of a kernel that 30% bigger than before, probably not faster, and set out to kill my dog, but thankfully running in QEMU away from my dog. Edit: well, removing LTO with the same options does make it like 10% even bigger.

gcs-github commented 5 years ago

It might be interesting to compare the speed of some syscall- / kernel-bound workloads when successfully built with LTO. Anyone with an idea on how to start benchmarking our gains or losses?

darkbasic commented 5 years ago

Not sure, but if you check the kernel mailing list plenty of those benchmarks have been done in the past. I remember seeing pretty big gains with LTO, but not sure if those reflected into any gain for daily usage. Some more info about how to benchmark the kernel: https://github.com/graysky2/kernel_gcc_patch

cb88 commented 5 years ago

One thing about LTO is you have to build as many of your models into the kernel as possible... so it knows what it can eliminate when linking... so you get the biggest gains on a completely static kernel (this of course breaks somethings that load firmware etc... some of that you can work around by building in the blobs though).

ms178 commented 5 years ago

Andi Kleen rebased his LTO patches for the Kernel on 4.20 recently. I've tried it out but had no luck and several module errors along the way. Nevertheless, you can find these patches here: https://github.com/andikleen/linux-misc/tree/lto-420-1

ionenwks commented 5 years ago

^ Didn't experiment much but gave it a quick try and it built fine for me with my configuration and CONFIG_LTO=y which auto-adds -flto -fno-fat-lto-objects. Didn't try a generic one and I use almost no modules which, as stated in the other above post, is better suited for a LTO kernel anyway.

Looks like it's using the gcc-ld script and working properly. I do have gold as my default linker (been using it even for kernel).

I imagine it may make more of a difference on a less-lean kernel, but my resulting 4.20 kernel is about 1% smaller than my old, didn't try to boot and also no idea for any performance gains.

Promaethius commented 5 years ago

@ionenwks I'm trying to replicate the steps on a gentoo system to build an LTO'd kernel. However, I always error out on the linking portion: /usr/lib/gcc/x86_64-pc-linux-gnu/9.1.0/../../../../x86_64-pc-linux-gnu/bin/ld: error: arch/x86/kernel/head_64.o: requires unsupported dynamic reloc 11; recompile with -fPIC I have added this flag to the base KBUILD_CFLAGS but to no effect. I also have ld.gold enabled by default. What version of GCC and binutils are you using? Did you make any configurations to the Makefile from Andi Kleen's repo?

ionenwks commented 5 years ago

Hmm... I tried again both with the lto-420-1 branch from back then along with same configuration and the newer lto-5.1-3, and I'm getting the same errors as you now (using gcc 8.3.0 and ld.bfd 2.32).

Not sure what I was using back then but looking at the date I assume I was on gcc 8.2 and binutils 2.30 I think? It's only something I tried real quick, I had no intention to stick with that for now (or boot it).

Edit: Retried with gold as default (switched back to bfd a while ago), doesn't work either, not with current toolchain anyway. Edit2: And no, I hadn't made any changes, used as-is.

Promaethius commented 5 years ago

@ionenwks thank you for taking the time to check through the issue! I was afraid it was a toolchain version issue, so I wonder if this is a reportable bug? I'm going to take some time today and check if its a gcc or binutils issue. Edit: I'm throwing some more configuration testing into this mess. Found this article over on the patch list: https://patchwork.kernel.org/patch/10000627/

jiblime commented 5 years ago

I was able to build 5.0-1 successfully, however I did not test it and the system it was on it now gone.

-fPIC would cause reloc .text errors if it was built with visibility=hidden or ssp(but the Makefile already filters that). Maybe -flinker-output=rel would make sense here, but I couldn't get the syntax correct. ~because parts of the kernel build are still static, and static objects aren't able to find PIC references~. If anyone knows his full patchset without a kernel tree that'd be really helpful.

Promaethius commented 5 years ago

@jiblime You can find his patchset on the kernel mailing list but it won't really help: https://lkml.org/lkml/2017/11/27/1052 THIN_ARCHIVES was a config option that was removed in 4.19+. It went around the supposed issue of ld -r. But, I've narrowed it down to a ld issue of some sort. There are kernel patches that let you fPIC the code but they aren't working for me yet.

jiblime commented 5 years ago

@Promaethius Thanks for the link. I'm currently trying to edit arch/x86/entry/vdso/Makefile to work. At the very bottom you can try appending flags after ${LD} but nothing has worked for me, even the options to specifically suppress the error.

I went and checked a regular kernel and I noticed that it's normal(?) for a hidden symbol to be there.

Both comands ran were readelf vclock_gettime.o -s

5.1-3 LTO:

Symbol table '.symtab' contains 25 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS vclock_gettime.c
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    3 
     4: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 
     5: 0000000000000000   174 FUNC    LOCAL  DEFAULT    1 do_hres
     6: 0000000000000000     0 SECTION LOCAL  DEFAULT    5 
     7: 0000000000000000     0 SECTION LOCAL  DEFAULT    7 
     8: 0000000000000000     0 SECTION LOCAL  DEFAULT    8 
     9: 0000000000000000     0 SECTION LOCAL  DEFAULT   10 
    10: 0000000000000000     0 SECTION LOCAL  DEFAULT   11 
    11: 0000000000000000     0 SECTION LOCAL  DEFAULT   12 
    12: 0000000000000000     0 SECTION LOCAL  DEFAULT   14 
    13: 0000000000000000     0 SECTION LOCAL  DEFAULT   15 
    14: 0000000000000000     0 SECTION LOCAL  DEFAULT   17 
    15: 0000000000000000     0 SECTION LOCAL  DEFAULT   19 
    16: 0000000000000000     0 SECTION LOCAL  DEFAULT   20 
    17: 0000000000000000     0 SECTION LOCAL  DEFAULT   18 
    18: 0000000000000000     0 NOTYPE  GLOBAL HIDDEN   UND vvar_vsyscall_gtod_data
    19: 00000000000000b0   111 FUNC    GLOBAL DEFAULT    1 __vdso_clock_gettime
    20: 00000000000000b0   111 FUNC    WEAK   DEFAULT    1 clock_gettime
    21: 0000000000000120    98 FUNC    GLOBAL DEFAULT    1 __vdso_gettimeofday
    22: 0000000000000120    98 FUNC    WEAK   DEFAULT    1 gettimeofday
    23: 0000000000000190    16 FUNC    GLOBAL DEFAULT    1 __vdso_time
    24: 0000000000000190    16 FUNC    WEAK   DEFAULT    1 time
readelf: Warning: compressed section '.debug_str' is corrupted

5.2.8 kernel:

Symbol table '.symtab' contains 27 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FILE    LOCAL  DEFAULT  ABS vclock_gettime.c
     2: 0000000000000000     0 SECTION LOCAL  DEFAULT    1 
     3: 0000000000000000     0 SECTION LOCAL  DEFAULT    3 
     4: 0000000000000000     0 SECTION LOCAL  DEFAULT    4 
     5: 0000000000000000   392 FUNC    LOCAL  DEFAULT    1 do_hres
     6: 0000000000000000     0 SECTION LOCAL  DEFAULT    5 
     7: 0000000000000000     0 SECTION LOCAL  DEFAULT    7 
     8: 0000000000000000     0 SECTION LOCAL  DEFAULT    8 
     9: 0000000000000000     0 SECTION LOCAL  DEFAULT   10 
    10: 0000000000000000     0 SECTION LOCAL  DEFAULT   11 
    11: 0000000000000000     0 SECTION LOCAL  DEFAULT   12 
    12: 0000000000000000     0 SECTION LOCAL  DEFAULT   14 
    13: 0000000000000000     0 SECTION LOCAL  DEFAULT   15 
    14: 0000000000000000     0 SECTION LOCAL  DEFAULT   17 
    15: 0000000000000000     0 SECTION LOCAL  DEFAULT   19 
    16: 0000000000000000     0 SECTION LOCAL  DEFAULT   20 
    17: 0000000000000000     0 SECTION LOCAL  DEFAULT   18 
    18: 0000000000000000     0 NOTYPE  GLOBAL HIDDEN   UND vvar_vsyscall_gtod_data
    19: 0000000000000000     0 NOTYPE  GLOBAL HIDDEN   UND hvclock_page
    20: 0000000000000000     0 NOTYPE  GLOBAL HIDDEN   UND pvclock_page
    21: 0000000000000190   102 FUNC    GLOBAL DEFAULT    1 __vdso_clock_gettime
    22: 0000000000000190   102 FUNC    WEAK   DEFAULT    1 clock_gettime
    23: 0000000000000200    98 FUNC    GLOBAL DEFAULT    1 __vdso_gettimeofday
    24: 0000000000000200    98 FUNC    WEAK   DEFAULT    1 gettimeofday
    25: 0000000000000270    16 FUNC    GLOBAL DEFAULT    1 __vdso_time
    26: 0000000000000270    16 FUNC    WEAK   DEFAULT    1 time

readelf: Warning: compressed section '.debug_str' is corrupted looks to be of interest. Does this mean there needs to be more debug information built in?

Promaethius commented 5 years ago

@jiblime I found this on the gcc site today: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fuse-linker-plugin-916

When a file is compiled with -flto without -fuse-linker-plugin, the generated object file is larger than a regular object file because it contains GIMPLE bytecodes and the usual final code (see -ffat-lto-objects. This means that object files with LTO information can be linked as normal object files; if -fno-lto is passed to the linker, no interprocedural optimizations are applied. Note that when -fno-fat-lto-objects is enabled the compile stage is faster but you cannot perform a regular, non-LTO link on them.

I've witnessed Andi Kleen's patchset passing -fno-fat-lto-objects without -fuse-linker-plugin. Will test this theory later today. This could explain why readelf is returning corruption, but pardon my ignorance if that's not the case.

jiblime commented 5 years ago

@Promaethius

I've witnessed Andi Kleen's patchset passing -fno-fat-lto-objects without -fuse-linker-plugin

That explains why he uses -fwhole-program and and -fipa-cp-clone, since collect2 would be used instead of a linker. I'm assuming he's doing that for compatibility, as GCC documentation claims it's likely to increase code size vs. bfd/gold. I wonder if GentooLTO would be able to do something better...

I believe it's a glibc issue. I've upgraded to sys-libs/glibc-2.30::gentoo and have been able to get past it. Currently recompiling since paravirtualization options, not sure which, causes it to error.

https://sourceware.org/ml/libc-alpha/2019-08/msg00029.html

* The dynamic linker no longer refuses to load objects which reference
  versioned symbols whose implementation has moved to a different soname
  since the object has been linked.  The old error message, symbol
  FUNCTION-NAME, version SYMBOL-VERSION not defined in file DSO-NAME with
  link time reference, is gone.

It emits a warning, I'm still not sure why since Andi Kleen filters LTO out of it from what I can tell.

Warnings emitted with V=2

CC arch/x86/entry/vdso/vdso32-setup.o - due to target missing LDS arch/x86/entry/vdso/vdso.lds - due to target missing AS arch/x86/entry/vdso/vdso-note.o - due to target missing CC arch/x86/entry/vdso/vclock_gettime.o - due to target missing In file included from ./arch/x86/include/asm/vgtod.h:5, from arch/x86/entry/vdso/vclock_gettime.c:15: arch/x86/entry/vdso/vclock_gettime.c: In function ‘do_hres’: ./include/linux/compiler.h:182:26: warning: array subscript 1 is outside array bounds of ‘u8[1]’ {aka ‘unsigned char[1]’} [-Warray-bounds] 182 | case 8: *(__u64 *)res = *(volatile __u64 *)p; break; \ | ^~~~~~~~~~~~~~~~~~~~ ./include/linux/compiler.h:193:2: note: in expansion of macro ‘__READ_ONCE_SIZE’ 193 | __READ_ONCE_SIZE; | ^~~~~~~~~~~~~~~~ arch/x86/entry/vdso/vclock_gettime.c:37:11: note: while referencing ‘hvclock_page’ 37 | extern u8 hvclock_page | ^~~~~~~~~~~~ In file included from ./arch/x86/include/asm/vgtod.h:5, from arch/x86/entry/vdso/vclock_gettime.c:15: ./include/linux/compiler.h:182:26: warning: array subscript 2 is outside array bounds of ‘u8[1]’ {aka ‘unsigned char[1]’} [-Warray-bounds] 182 | case 8: *(__u64 *)res = *(volatile __u64 *)p; break; \ | ^~~~~~~~~~~~~~~~~~~~ ./include/linux/compiler.h:193:2: note: in expansion of macro ‘__READ_ONCE_SIZE’ 193 | __READ_ONCE_SIZE; | ^~~~~~~~~~~~~~~~ arch/x86/entry/vdso/vclock_gettime.c:37:11: note: while referencing ‘hvclock_page’ 37 | extern u8 hvclock_page | ^~~~~~~~~~~~ CC arch/x86/entry/vdso/vgetcpu.o - due to target missing VDSO arch/x86/entry/vdso/vdso64.so.dbg - due to target missing OBJCOPY arch/x86/entry/vdso/vdso64.so - due to target missing HOSTCC arch/x86/entry/vdso/vdso2c - due to target missing VDSO2C arch/x86/entry/vdso/vdso-image-64.c - due to target missing CC arch/x86/entry/vdso/vdso-image-64.o - due to target missing LDS arch/x86/entry/vdso/vdso32/vdso32.lds - due to target missing CC arch/x86/entry/vdso/vdso32/vclock_gettime.o - due to target missing In file included from ./arch/x86/include/asm/vgtod.h:5, from arch/x86/entry/vdso/vdso32/../vclock_gettime.c:15, from arch/x86/entry/vdso/vdso32/vclock_gettime.c:31: arch/x86/entry/vdso/vdso32/../vclock_gettime.c: In function ‘do_hres’: ./include/linux/compiler.h:182:26: warning: array subscript 1 is outside array bounds of ‘u8[1]’ {aka ‘unsigned char[1]’} [-Warray-bounds] 182 | case 8: *(__u64 *)res = *(volatile __u64 *)p; break; \ | ^~~~~~~~~~~~~~~~~~~~ ./include/linux/compiler.h:193:2: note: in expansion of macro ‘__READ_ONCE_SIZE’ 193 | __READ_ONCE_SIZE; | ^~~~~~~~~~~~~~~~ In file included from arch/x86/entry/vdso/vdso32/vclock_gettime.c:31: arch/x86/entry/vdso/vdso32/../vclock_gettime.c:37:11: note: while referencing ‘hvclock_page’ 37 | extern u8 hvclock_page | ^~~~~~~~~~~~ In file included from ./arch/x86/include/asm/vgtod.h:5, from arch/x86/entry/vdso/vdso32/../vclock_gettime.c:15, from arch/x86/entry/vdso/vdso32/vclock_gettime.c:31: ./include/linux/compiler.h:182:26: warning: array subscript 2 is outside array bounds of ‘u8[1]’ {aka ‘unsigned char[1]’} [-Warray-bounds] 182 | case 8: *(__u64 *)res = *(volatile __u64 *)p; break; \ | ^~~~~~~~~~~~~~~~~~~~ ./include/linux/compiler.h:193:2: note: in expansion of macro ‘__READ_ONCE_SIZE’ 193 | __READ_ONCE_SIZE; | ^~~~~~~~~~~~~~~~ In file included from arch/x86/entry/vdso/vdso32/vclock_gettime.c:31: arch/x86/entry/vdso/vdso32/../vclock_gettime.c:37:11: note: while referencing ‘hvclock_page’ 37 | extern u8 hvclock_page | ^~~~~~~~~~~~ AS arch/x86/entry/vdso/vdso32/note.o - due to target missing AS arch/x86/entry/vdso/vdso32/system_call.o - due to target missing AS arch/x86/entry/vdso/vdso32/sigreturn.o - due to target missing VDSO arch/x86/entry/vdso/vdso32.so.dbg - due to target missing OBJCOPY arch/x86/entry/vdso/vdso32.so - due to target missing VDSO2C arch/x86/entry/vdso/vdso-image-32.c - due to target missing CC arch/x86/entry/vdso/vdso-image-32.o - due to target missing

So as I understand, it would be a huge issue to have a textrel in a/the vdso because it'd be a vulnerability in a security feature. Gentoo's wiki actually has a guide on finding and fixing textrels: https://wiki.gentoo.org/wiki/Hardened/Textrels_Guide

But hopefully there's no need to recreate anything. While the vdso*.so files have a textrel flag marked on them, scanelf -T shows that there isn't anything that would point to it.

Glibc 2.29, GCC 9.1.0

 TYPE    PAX   PERM ENDIAN STK/REL/PTL TEXTREL RPATH BIND TEXTRELS FILE 
scanelf: scanelf_file_textrels(): ELF is missing relocation information
scanelf: scanelf_file_textrels(): ELF vdso32.so has TEXTREL markings but doesnt appear to have any real TEXTREL's !?
ET_DYN PeMRxS 0755 LE --- --- R-X TEXTREL   -   LAZY  vdso32.so

It did also emit this, though:

arch/x86/kernel/dumpstack.o: warning: objtool: show_regs.cold()+0x16: sibling call from callable instruction with modified stack frame
arch/x86/kernel/dumpstack.o: warning: objtool: show_regs()+0x0: stack state mismatch: cfa1=7+24 cfa2=7+8

So it looks like it can be possible, but definitely experimental and not a daily driver for myself. I'm going to be grabbing GCC 9.2 now so I won't be getting to it anytime soon (btw, I added 20G of swap with -j5 and it still failed, dammit), but if Glibc 2.30 is the fix, I think it'd be worth a shot to try using this kernel for testing.

If you were to use a linker instead of collect2 you can run replace -fwhole-program with fuse-linker-plugin in scripts/Makefile.lto as Gnu documentation states it's best not to use the former with the latter. Optimizations that would also help LTO specifically would be -fdevirtualize-at-ltrans and the -fgraphite-identity -floop-nest-optimize options. I've used these along other flags to compile and run my kernel, but if the linking stage is too much the process will overflow and it'll end.

What's interesting is that his newest version (as far as I can tell) lacks explicit linker usage but his older versions use -fuse-linker-plugin. So I could be wrong in assuming that removing -fwhole-program is the right way to go.

jiblime commented 4 years ago

Andi Kleen's lto-5.7-2 branch branch builds and I am currently running it. I've applied the 5.7.14 patch, Gentoo distro patches, and a few other misc. patches with no rejects.

Notes:

The size of my LTO'd kernel is 22M, modules folder is 800K. Vs. my normal kernel at 11M and modules folder at 71M


Semi-related:

GCC 10's -O2 might be slightly slower than GCC 9's -O2

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96337#c15

Inliner changes was not targetting to make compile time faster and
compiled code slower. It was intended to reflect more closely modern C++
codebases and get faster binaries (at -O2 and -O2 -flto) without
regressing in code sizes.  In fact more inlining happens and thus we
needed to optimize inliner code carefully to avoid regressions with LTO.

If you have a -march=znver1/znver2 processor and run x86_64 multilib, rebuilding the current GCC 10.2.0 would mean a nice performance boost with this patch:

patch 1, patch 2

Refer to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95435


Correction 1: I incorrectly assumed modules weren't supported with -flto. While building everything into the kernel alleviated the issues, namely framebuffer and Logitech USB support, kernel compilation time was too long and I prefer being able to reload modules. The likely culprit in module failure was TRIM_UNUSED_KSYMS and possibly dracut defaulting to --strip the generated initrd; can't say for certain yet. I didn't get around to testing it enough but now I am able to load amdgpu in my initrd as usual instead of compiling it in.

telans commented 4 years ago
* It feels fast, that counts

Can you describe in what way?

Cheers for the gcc links too

barolo commented 4 years ago

oooh, imma test

barolo commented 4 years ago

@jiblime Could you list the patches applied? All are from gentoo's ebuild?

telans commented 4 years ago

@jiblime Could you list the patches applied? All are from gentoo's ebuild?

I haven't built it yet, but this patch applies fine to gentoo-sources-5.8 (just a diff from the lto-5.8.0-1 branch)

https://gist.github.com/telans/728b63dd07c41c9ca6e2ca3d4431db8e


Doesn't build for me unfortunately, lots of:

/usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/../../../../x86_64-pc-linux-gnu/bin/ld: ./.tmp_vmlinux.kallsyms1.mJXteD.ltrans123.ltrans.o: relocation R_X86_64_32S against `.data' can not be used when making a PIE object; recompile with -fPIE
telans commented 4 years ago

also, there's no 5.7.14 patch

https://cdn.kernel.org/pub/linux/kernel/v5.x/patch-5.7.14.xz

barolo commented 4 years ago

that patch is already applied... nvm, messed up something, ended with upstream master somehow... this will impact my sdd most def

jiblime commented 4 years ago

@telans No problem. The first thing I noticed was my dmesg timestamps were lower than usual :p ideally I'll set up a phoronix benchmark to have actual data.

relocation R_X86_64_32S

Are you using ld.gold as your default linker? The Linux kernel needs either GCC/ld.bfd or Clang/ld.lld. https://github.com/InBetweenNames/gentooLTO/issues/338

sys-devel/gcc-10.2.0::gentoo was built with the following:
USE="(cxx) fortran graphite lto (multilib) nls nptl objc openmp pch pgo sanitize ssp zstd (-ada) -d -debug -doc (-fixed-point) -go (-hardened) -jit (-libssp) -objc++ -objc-gc -pie -systemtap -test -vanilla -vtv" ABI_X86="(64)"
sys-devel/binutils-2.34-r2::gentoo was built with the following:
USE="gold multitarget nls plugins static-libs -default-gold -doc -test" ABI_X86="(64)"

@barolo https://github.com/jiblime/linux-misc/commits/lto-5.7-prjc-r3 You can pull the patches from here or clone the single branch and build off that. The CPPC patch doesn't work for me, so I leave it off just in case it would case me to fail to boot. It's a bit messy, I'm still not the greatest at making clean commits. I chose the 5.7-2 branch instead of 5.8 because I wanted to try the Project C scheduler (previously named BMQ, now abbreviated prjc). I'll try the 5.8 branch sometime.

I generally download a vanilla tarball from kernel.org (v5.7, v5.8, etc) and apply the Gentoo patches and incremental patches afterwards. That way I don't have to worry about rejected patches as often

barolo commented 4 years ago

@jiblime thanks for the branch, made it much easier for me. Compiling

barolo commented 4 years ago

compiled almost cleanly for me, didn't take that long too, had a bunch of "-Wstringop-overflow" warnings for Bluetooth module. Didn't boot for me with error related to scsi. With modules builtin it is 20M , modules dir i 1M I have nvme and amdgpu on that box, gonna try to strip it a bit more

barolo commented 4 years ago

Narrowed it down, hidpp/logitech's stuff makes it crash, and it doesn't switch to amdgpu output @jiblime it seems like you\ve had similar issues, how did you solve them? Edit. Cleaned it a bit, built amdgpu, bluetooth, and logitech hidpp as modules, the remaining issue seems to be that framebuffer isn't being switched during boot

telans commented 4 years ago

Are you using ld.gold as your default linker? The Linux kernel needs either GCC/ld.bfd or Clang/ld.lld.

Nope, using ld.bfd ( or at least I haven't changed it.)

sys-devel/gcc-10.2.0::gentoo was built with the following:
USE="(cxx) fortran graphite lto (multilib) nls nptl openmp pch pgo (pie) sanitize ssp vtv zstd (-ada) -d -debug -doc (-fixed-point) -go (-hardened) (-jit) (-libssp) -objc -objc++ -objc-gc -systemtap -test -vanilla" ABI_X86="(64)"
sys-devel/binutils-2.34-r2::gentoo was built with the following:
USE="gold nls plugins -default-gold -doc -multitarget -static-libs -test" ABI_X86="(64)"

Forcing LD=ld.bfd doesn't change anything either. I thought it might have been an issue with ripping a patch from the lto-5.8-1 branch, however, the branch too builds with the same relocation errors


Same issue with lto-5.7-2

barolo commented 4 years ago

Update, managed to run it and reach the desktop. The issue was with building all modules in. So I took my working config as base, used genkernel and made sure that it runs without LTO enabled first, then enabled LTO and booted into desktop successfully. Ended with a bunch of drivers disabled, most importantly for network and sata, luckily my main is a pcie one. Each failed module had disagrees about version of symbol module_layout in dmesg, gonna investigate it now.

Edit. It seems that all of those are modules that weren't built in, so it seems that initramfs isn't working for me Edit2. I'm typing from it, had to recompile it cleanly, cleaned it a bit and built some stuff in, module loading doesn't seem to work as I still got two of those disagrees... warnings

Can't really compare it yet, since it seems to use diff schedulers than I had with zen kernel, and spends more time at lower frequencies, would have to bench it properly to test it seriously.

I can already tell though that building that kernel is significantly faster under it

Promaethius commented 4 years ago

My gut tells me it has something to do with the -fPIE flag

On Sun, Aug 9, 2020, 3:27 AM Greg Shuiske notifications@github.com wrote:

Update, managed to run it and reach the desktop. The issue was with building all modules in. So I took my working config as base, used genkernel and made sure that it runs without LTO enabled first, then enabled LTO. Ended with a bunch of drivers disabled, most importantly for network and sata, luckily my main is a pcie one. Each module had disagrees about version of symbol module_layout in dmesg, gonna investigate it now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/InBetweenNames/gentooLTO/issues/90#issuecomment-671029236, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAFFNN3MKRMCYOAXWIEI7BLR7ZTWVANCNFSM4EN5L3PQ .

barolo commented 4 years ago

@Promaethius I've solved that by having those with warnings changed to built-in, It's running fine so far, gonna bench it with something now. My whole kernel with inbuilt stuff is 10 MB, with useless 4MB initramfs, for gaming desktop

gottaeat commented 4 years ago

when linux 5.7.14 w/ the lto-5.7-2 using ld.bfd from binutils 2.35 and gcc-10.1.0 i'm getting countless errors all telling me to recompile with -fPIE similar to what @telans has mentioned:

  DESCEND  objtool
  CALL    scripts/atomic/check-atomics.sh
  CALL    scripts/checksyscalls.sh
  CHK     include/generated/compile.h
  GEN     .version
  CHK     include/generated/compile.h
  UPD     include/generated/compile.h
  CC      init/version.o
  AR      init/built-in.a
  LDFINAL vmlinux.o
kernel/bpf/core.c: In function 'bpf_patch_insn_single':
kernel/bpf/core.c:442:3: warning: writing 8 bytes into a region of size 0 [-Wstringop-overflow=]
  442 |   memcpy(prog->insnsi + off, patch, sizeof(*patch));
      |   ^
./include/linux/filter.h:550:20: note: at offset 0 to object 'insnsi' with size 0 declared here
  550 |   struct bpf_insn  insnsi[0];
      |                    ^
  MODPOST vmlinux.o
  MODINFO modules.builtin.modinfo
  GEN     modules.builtin
  LDFINAL .tmp_vmlinux.kallsyms1
/usr/bin/ld: arch/x86/kernel/head_64.o: relocation R_X86_64_32S against symbol `early_top_pgt' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/entry/entry_64.o: relocation R_X86_64_32S against symbol `cpu_tss_rw' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/entry/vdso/vma.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/entry/vdso/vdso32-setup.o: relocation R_X86_64_32S against symbol `vdso_image_32' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/entry/entry_64_compat.o: relocation R_X86_64_32S against symbol `cpu_tss_rw' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/kvm/vmx/vmenter.o: relocation R_X86_64_32S against symbol `kvm_rebooting' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/realmode/init.o: relocation R_X86_64_32S against `.rodata.str1.8' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/kernel/acpi/wakeup_64.o: relocation R_X86_64_32S against symbol `saved_magic' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/kernel/ftrace_64.o: relocation R_X86_64_32S against symbol `ftrace_trace_function' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/crypto/aesni-intel_asm.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/crypto/aesni-intel_avx-x86_64.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/crypto/ghash-clmulni-intel_asm.o: relocation R_X86_64_32S against `.rodata.cst16.bswap_mask' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/power/hibernate_asm_64.o: relocation R_X86_64_32S against symbol `saved_context' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans0.ltrans.o: relocation R_X86_64_32S against hidden symbol `cpu_hw_events' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans1.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans2.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans3.ltrans.o: relocation R_X86_64_32S against symbol `em_bsf' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans4.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans5.ltrans.o: relocation R_X86_64_32S against hidden symbol `cpu_bit_bitmap' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans6.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans7.ltrans.o: relocation R_X86_64_32S against `.text' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans8.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans9.ltrans.o: relocation R_X86_64_32S against `.bss' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans10.ltrans.o: relocation R_X86_64_32S against `.data..read_mostly' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans11.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans12.ltrans.o: relocation R_X86_64_32S against hidden symbol `sig_sicodes.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans13.ltrans.o: relocation R_X86_64_32S against `.data' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans14.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans15.ltrans.o: relocation R_X86_64_32S against hidden symbol `rcu_data.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans16.ltrans.o: relocation R_X86_64_32S against hidden symbol `tk_fast_mono.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans17.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans18.ltrans.o: relocation R_X86_64_32S against hidden symbol `trace_types_lock' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans19.ltrans.o: relocation R_X86_64_32S against hidden symbol `ftrace_common_fields.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans20.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans21.ltrans.o: relocation R_X86_64_32S against hidden symbol `init_task' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans22.ltrans.o: relocation R_X86_64_32S against hidden symbol `contig_page_data' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans23.ltrans.o: relocation R_X86_64_32S against hidden symbol `vma_interval_tree_augment_rotate.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans24.ltrans.o: relocation R_X86_64_32S against hidden symbol `vmap_area_list' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans25.ltrans.o: relocation R_X86_64_32S against `.data' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans26.ltrans.o: relocation R_X86_64_32S against hidden symbol `__per_cpu_offset' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans27.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans28.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans29.ltrans.o: relocation R_X86_64_32S against `.data..percpu' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans30.ltrans.o: relocation R_X86_64_32S against hidden symbol `bit_wait_table.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans31.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans32.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans33.ltrans.o: relocation R_X86_64_32S against hidden symbol `tty_drivers' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans34.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans35.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans36.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans37.ltrans.o: relocation R_X86_64_32S against `.text' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans38.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans39.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans40.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans41.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans42.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans43.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans44.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans45.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans46.ltrans.o: relocation R_X86_64_32S against `.rodata..c_jump_table' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans47.ltrans.o: relocation R_X86_64_32S against hidden symbol `queue_io_timeout_entry.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans48.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans49.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans50.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans51.ltrans.o: relocation R_X86_64_32S against hidden symbol `dev_attr_boot_vga.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans52.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans53.ltrans.o: relocation R_X86_64_32S against hidden symbol `v86d_path.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans54.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans55.ltrans.o: relocation R_X86_64_32S against hidden symbol `_ctype' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans56.ltrans.o: relocation R_X86_64_32S against hidden symbol `_ctype' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans57.ltrans.o: relocation R_X86_64_32S against `.bss' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans58.ltrans.o: relocation R_X86_64_32S against hidden symbol `vc_cons' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans59.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans60.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans61.ltrans.o: relocation R_X86_64_32S against `.text' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans62.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans63.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.8' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans64.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans65.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans66.ltrans.o: relocation R_X86_64_32S against hidden symbol `execlists_submit_request.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans67.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans68.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans69.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans70.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.8' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans71.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans72.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans73.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans74.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans75.ltrans.o: relocation R_X86_64_32S against hidden symbol `device_ktype.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans76.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans77.ltrans.o: relocation R_X86_64_32S against hidden symbol `dma_buf_fops.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans78.ltrans.o: relocation R_X86_64_32S against hidden symbol `scsi_host_type.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans79.ltrans.o: relocation R_X86_64_32S against `.text' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans80.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans81.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans82.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans83.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans84.ltrans.o: relocation R_X86_64_32S against hidden symbol `ant_toggle_lookup.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans85.ltrans.o: relocation R_X86_64_32S against hidden symbol `dev_attr_manufacturer.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans86.ltrans.o: relocation R_X86_64_32S against `.data' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans87.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans88.ltrans.o: relocation R_X86_64_32S against hidden symbol `scsi_host_type.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans89.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans90.ltrans.o: relocation R_X86_64_32S against `.data' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans91.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans92.ltrans.o: relocation R_X86_64_32S against `.data' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans93.ltrans.o: relocation R_X86_64_32S against hidden symbol `tpacpi_all_drivers.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans94.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans95.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans96.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans97.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans98.ltrans.o: relocation R_X86_64_32S against `.bss' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans99.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans100.ltrans.o: relocation R_X86_64_32S against hidden symbol `rtnl_af_ops.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans101.ltrans.o: relocation R_X86_64_32S against hidden symbol `offload_base.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans102.ltrans.o: relocation R_X86_64_32S against hidden symbol `loggers.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans103.ltrans.o: relocation R_X86_64_32S against `.text' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans104.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans105.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans106.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans107.ltrans.o: relocation R_X86_64_32S against `.text' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans108.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans109.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans110.ltrans.o: relocation R_X86_64_32S against `.bss' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans111.ltrans.o: relocation R_X86_64_32S against hidden symbol `ptype_lock.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans112.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans113.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans114.ltrans.o: relocation R_X86_64_32S against hidden symbol `init_uts_ns' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans115.ltrans.o: relocation R_X86_64_32S against hidden symbol `swevent_htable.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans116.ltrans.o: relocation R_X86_64_32S against hidden symbol `ieee802_1d_to_ac' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans117.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans118.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans119.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans120.ltrans.o: relocation R_X86_64_32S against hidden symbol `tcf_action_policy.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans121.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.8' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans122.ltrans.o: relocation R_X86_64_32S against hidden symbol `con2fb_map.lto_priv.0' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans123.ltrans.o: relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans124.ltrans.o: relocation R_X86_64_32S against symbol `_stext' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans125.ltrans.o: relocation R_X86_64_32S against hidden symbol `init_mm' can not be used when making a PIE object
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans126.ltrans.o: relocation R_X86_64_32S against `.rodata.str1.1' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: ./.tmp_vmlinux.kallsyms1.KgBLDA.ltrans127.ltrans.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/lib/getuser.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: arch/x86/lib/putuser.o: relocation R_X86_64_32S against symbol `current_task' can not be used when making a PIE object; recompile with -fPIE
/usr/bin/ld: drivers/base/firmware_loader/builtin/iwlwifi-4965-2.ucode.gen.o: warning: relocation in read-only section `.builtin_fw'
collect2: error: ld returned 1 exit status
make: *** [Makefile:1111: vmlinux] Error 1
jiblime commented 4 years ago

My gut tells me it has something to do with the -fPIE flag

I agree. It looks like PIE may be the main reason why there are build issues. I'm unsure of what implementation is best to remove auto PIE just for the kernel.

When testing out with gcc -m32 {-fpie ; -fPIE ; -Wl,-pie} file.c -o m32.

$ gcc -m32 -Wl,-pie file.c -o m32
$ file m32

ELF 32-bit LSB pie executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 3.2.0, not stripped

Only -Wl,-pie worked, meaning -Wl,-no-pie or -Wl,-no-PIE would be ideal? Except it error'd on me, and file tells me that -m64 compiles aren't PIE even with those set. Could be a problem with my toolchain and inexpertise.


I'd suggest just rebuilding GCC without pie, though you may need to rebuild multiple parts of your toolchain that have been emerged with pie, namely binutils/glibc/libtool, but some other suggestions you can try first:

make KCFLAGS="-fno-pie -fno-PIE -Wl,-no-pie" LDFLAGS_MODULE="Wl,-no-pie" -j"$(grep processor /proc/cpuinfo | wc -l)"

or

In ./scripts/Makefile.build, I believe these are inherited for all builds unless the Makefile has a specific override.

EXTRA_AFLAGS   :=
EXTRA_CFLAGS   :=
EXTRA_CPPFLAGS :=
EXTRA_LDFLAGS  :=

You can also append V=1 to make and get verbose output to see what flags are being used if that may help:

With the suggested KCFLAGS and LDFLAGS_MODULE
``` gcc -Wp,-MD,drivers/gpio/.gpiolib-sysfs.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Wno-format-security -std=gnu89 -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -march=native -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables -mindirect-branch=thunk-extern -mindirect-branch-register -fno-jump-tables -fno-delete-null-pointer-checks -Wno-frame-address -Wno-format-truncation -Wno-format-overflow -Wno-address-of-packed-member -O2 -fno-allow-store-data-races -fstack-protector-strong -Wno-unused-but-set-variable -Wimplicit-fallthrough -Wno-unused-const-variable -fomit-frame-pointer -fno-var-tracking-assignments -fno-inline-functions-called-once -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wno-stringop-truncation -Wno-zero-length-bounds -Wno-array-bounds -Wno-stringop-overflow -Wno-restrict -Wno-maybe-uninitialized -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -fmacro-prefix-map=./= -fcf-protection=none -Wno-packed-not-aligned -flto -flto-compression-level=9 -fno-fat-lto-objects -fno-pie -fno-PIE -Wl,-no-pie -DKBUILD_MODFILE='"drivers/gpio/gpiolib-sysfs"' -DKBUILD_BASENAME='"gpiolib_sysfs"' -DKBUILD_MODNAME='"gpiolib_sysfs"' -c -o drivers/gpio/gpiolib-sysfs.o drivers/gpio/gpiolib-sysfs.c``` And without: ``` gcc -Wp,-MD,kernel/power/.qos.o.d -nostdinc -isystem /usr/lib/gcc/x86_64-pc-linux-gnu/10.2.0/include -I./arch/x86/include -I./arch/x86/include/generated -I./include -I./arch/x86/include/uapi -I./arch/x86/include/generated/uapi -I./include/uapi -I./include/generated/uapi -include ./include/linux/kconfig.h -include ./include/linux/compiler_types.h -D__KERNEL__ -Wall -Wundef -Werror=strict-prototypes -Wno-trigraphs -fno-strict-aliasing -fno-common -fshort-wchar -fno-PIE -Werror=implicit-function-declaration -Werror=implicit-int -Wno-format-security -std=gnu89 -mno-sse -mno-mmx -mno-sse2 -mno-3dnow -mno-avx -m64 -falign-jumps=1 -falign-loops=1 -mno-80387 -mno-fp-ret-in-387 -mpreferred-stack-boundary=3 -mskip-rax-setup -march=native -mno-red-zone -mcmodel=kernel -Wno-sign-compare -fno-asynchronous-unwind-tables -mindirect-branch=thunk-extern -mindirect-branch-register -fno-jump-tables -fno-delete-null-pointer-checks -Wno-frame-address -Wno-format-truncation -Wno-format-overflow -Wno-address-of-packed-member -O2 -fno-allow-store-data-races -fstack-protector-strong -Wno-unused-but-set-variable -Wimplicit-fallthrough -Wno-unused-const-variable -fomit-frame-pointer -fno-var-tracking-assignments -fno-inline-functions-called-once -Wdeclaration-after-statement -Wvla -Wno-pointer-sign -Wno-stringop-truncation -Wno-zero-length-bounds -Wno-array-bounds -Wno-stringop-overflow -Wno-restrict -Wno-maybe-uninitialized -fno-strict-overflow -fno-merge-all-constants -fmerge-constants -fno-stack-check -fconserve-stack -Werror=date-time -Werror=incompatible-pointer-types -Werror=designated-init -fmacro-prefix-map=./= -fcf-protection=none -Wno-packed-not-aligned -flto -flto-compression-level=9 -fno-fat-lto-objects -DDEBUG -DKBUILD_MODFILE='"kernel/power/qos"' -DKBUILD_BASENAME='"qos"' -DKBUILD_MODNAME='"qos"' -c -o kernel/power/qos.o kernel/power/qos.c ```

Some more info on PIE: https://wiki.gentoo.org/wiki/Hardened/Toolchain#Position_Independent_Executables_.28PIEs.29 https://stackoverflow.com/questions/34519521/why-does-gcc-create-a-shared-object-instead-of-an-executable-binary-according-to https://stackoverflow.com/questions/43367427/32-bit-absolute-addresses-no-longer-allowed-in-x86-64-linux

jiblime commented 4 years ago

Some issues I found with the v5.8-1 linux-misc LTO branch:

config SINGLE_LINK Problem: Final stages of kernel compile failed ``` Inconsistent kallsyms data Try make KALLSYMS_EXTRA_PASS=1 as a workaround make: *** [Makefile:1142: vmlinux] Error 1 KALLSYMS_SINGLE ``` Appending the aforementioned did not work. Disabling and rebuilding the kernel fixed that issue.
KALLSYMS_SINGLE, Kernel failed to boot. Only tested this one a couple of times.
CRYPTO_DEV_CCP_DD An AMD specific thing, to my knowledge it has never worked for anyone. May cause your kernel to not boot if you have it built in. I used to have it as a module until I realized nobody has it working on Linux. ``` > dmesg | grep ccp ccp 0000:09:00.1: runtime IRQ mapping not provided by arch ccp 0000:09:00.1: enabling device (0000 -> 0002) ccp 0000:09:00.1: enabling bus mastering ccp 0000:09:00.1: ccp: unable to access the device: you might be running a broken BIOS. ``` Interesting thread on AMD's cryptographic coprocessor: https://forum.gigabyte.us/thread/9479/bug-linux-x570-aorus-initialize
Regarding module configs What I current have: ``` # CONFIG_MODVERSIONS is not set # CONFIG_MODULE_SRCVERSION_ALL is not set CONFIG_MODULE_SIG=y CONFIG_MODULE_SIG_FORCE=y CONFIG_MODULE_SIG_ALL=y # CONFIG_MODULE_SIG_SHA1 is not set # CONFIG_MODULE_SIG_SHA224 is not set CONFIG_MODULE_SIG_SHA256=y # CONFIG_MODULE_SIG_SHA384 is not set # CONFIG_MODULE_SIG_SHA512 is not set CONFIG_MODULE_SIG_HASH="sha256" # CONFIG_MODULE_COMPRESS is not set # CONFIG_MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS is not set # CONFIG_UNUSED_SYMBOLS is not set # CONFIG_TRIM_UNUSED_KSYMS is not set ``` This is for 5.8.0; what may work for me might not work for you.
Options unrelated to LTO These are needed for my system, UEFI with systemd-boot as my manager ``` EFI_STUB=y This is required when booting a kernel + initrd with systemd-boot https://www.freedesktop.org/wiki/Software/systemd/systemd-boot/ X86_SYSFB and FB_SIMPLE DRM and AMDGPU Required to get simple framebuffer to load, then to switch over to amdgpu EFIVAR_FS=y and EFI_VARS=n The first one is modern, I don't know why the second one still exists. EXTRA_FIRMWARE I've always done this, but you can also try embedding driver firmware into the kernel. (the binary blobs in /lib/firmware) ``` Gentoo's distro patches that autoselects options that most need to boot is very helpful ``` # My current dracut invoke -- never used the stub before now but "shrug" dracut --kver {kernel-version} --no-compress --fstab --uefi-stub /usr/lib/systemd/boot/efi/linuxx64.efi.stub # My dracut.conf add_dracutmodules+=" systemd systemd-initrd busybox dracut-systemd kernel-modules drm usrmount fs-lib crypt dm lvm lvmmerge securityfs " omit_dracutmodules+="i18n shutdown ecryptfs nbd dmraid mdraid plymouth bootchart dash btrfs stratis cifs nfs biosdevname iscsi fcoe fcoe-uefi" add_drivers+=" amdgpu vfio_pci vfio vfio_iommu_type1 vfio_virqfd " # I comment this out when the modules are built in ```

compiled almost cleanly for me, didn't take that long too, had a bunch of "-Wstringop-overflow" warnings for Bluetooth module. Didn't boot for me with error related to scsi. With modules builtin it is 20M , modules dir i 1M I have nvme and amdgpu on that box, gonna try to strip it a bit more

I got those too, bluetooth and a couple other things. Apparently it's related to GCC aggressively inlining for no reason, and it's been a thing for a while. I believe the the fix is to directly tell GCC not to inline those modules. I think it would be a good idea to make a list of modules to tell GCC not to inline.

I'm sure there's a kernel doc on that somewhere...Linus really hates -O3 for its inlining lol. Sidenote, GCC 10 no longer enables -funroll-loops at any opt level. Side-side note: RDRAND is re-enabled, rerun cpuid2cpuflags to see if your list has changed

barolo commented 4 years ago

I have GCC built without -fPIE, That would explain why I'm succeding. I still can't load modules though

jiblime commented 4 years ago

@barolo Perhaps try KALLSYMS_ALL?

gottaeat commented 4 years ago

@jiblime are you suggesting that rebuilding of a toolchain w/o pie as a possible solution to my problem here? i don't have any modules.

barolo commented 4 years ago

@jblime I'm yet to try KALLSYMS_ALL but I've already added -fdevirtualize-at-ltrans -fgraphite-identity -floop-nest-optimize, checked that it's actually added during compilation and it works! Running it right now, now to complete the trifecta PGO would be needed xD

telans commented 4 years ago

added -fdevirtualize-at-ltrans -fgraphite-identity -floop-nest-optimize

Benchmarks would be awesome if you could


@barolo Mirroring a phoronix kernel benchmark would probably be the easiest with benches such as: https://www.phoronix.com/scan.php?page=article&item=linux-50-sliding&num=2

barolo commented 4 years ago

@telans I was just about to ask... xD Could someone recommend a simple bench to test the kernel? ( possibly not Phoronix.. )

Edit. Averaged Motionmark 1.1 is 20% faster than under Zen kernel, both set to performance mode. I will try the same kernel now just without LTO/graphite, to have a valid comparison.

jiblime commented 4 years ago

@mssx86 Yes. Ideally your toolchain and @system. If that is too time consuming it'd probably be enough to just rebuild GCC and binutils. There was a discussion about it: https://github.com/InBetweenNames/gentooLTO/issues/261 It has a paper on the overhead that PIE introduces and how to unmask the (pie) useflag.

@barolo I did too. I'm surprised it's running haha! Check out this patch too, I think it might make sense: https://raw.githubusercontent.com/jiblime/clear-ck-gentoo-sources/5.8-lto-exp/misc/0004-Add-option-to-use-fno-semantic-interposition-and-fde.patch

I also added -fno-inline-functions -fno-inline-functions-called-once --param=large-stack-frame-growth=100 to my KCFLAGS, no more overflow warnings.

And I agree @telans, a standard would be nice to measure. It would be a lot easier to quantify/benchmark changes in the kernel vs. something system-wide. These are from perf but I wouldn't put any weight in it

5.8 LTO ``` # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.077 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 1.887 [sec] 1.887349 usecs/op 529843 ops/sec ```
5.7.14 no LTO ``` # Running sched/messaging benchmark... # 20 sender and receiver processes per group # 10 groups == 400 processes run Total time: 0.091 [sec] # Running sched/pipe benchmark... # Executed 1000000 pipe operations between two processes Total time: 5.107 [sec] 5.107420 usecs/op 195793 ops/sec ```

If only Phoronix suites were so quick 😛

barolo commented 4 years ago

@jiblime ~so you have fully working 5.8 now? any differences from 5.7 regarding LTO?~ Could you explain those additional flags?

It's definitely quicker, you can see, feel it even without benches. The difference is so significant in execution time that I thought it's the different scheduler lowering peak cpu usage initially Are you doing phoronix? That would save me from some pain...

Edit. I'm on zen patched 5.8 lto kernel now, for fair comparison since Zen is my kernel of choice due to fsync. Doing some phoronix tests now

gottaeat commented 4 years ago

@jiblime can confirm that after rebuilding gcc with --disable-default-pie, i managed to build and boot with the lto'd kernel. thanks a bunch.

telans commented 4 years ago

Same here, sys-devel/gcc without (pie) works a treat. I did rebuild @system just to be safe too.

I also encountered the same issue with config SINGLE_LINK @jiblime. However, my kernel isn't any larger than before (no external modules aside from nvidia) at 9.3M vs the old at 9.5M


Boots fine but I'm coming across nvidia: disagrees about version of symbol module_layout. I did rebuild multiple time with the kernel loaded/unloaded etc. Even when lto is disabled from the kernel config options the error appears. The 'fix' for me was to remove the lto patch I mentioned above.

Looks like this might be as far as I can go with this until I get an AMD card or something along those lines. Any other suggestions? Otherwise I'm not sure what to do.

Thanks all

jiblime commented 4 years ago

Could you explain those additional flags?

@barolo -fno-inline-functions disables automatic inlining on any possible function which was the default -O2 for GCC9. The second one probably isn't helpful to disable, the third is a reduction from a max of +1000% to +100% for stack growth caused by inlining. I chose those because overly aggressive inlining in some of the kernel's code only makes it bigger with no benefit and if I understand, large stacks in the kernel will mean some instructions won't fit in the CPU cache -> slower

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49194 https://lwn.net/Articles/166172/

@telans I've found that rebuilding the kernel will remove some symbols from Module.symvers. What I did to fix that was backup my .config, run make mrproper and move the config back. After rebuilding the kernel a new Module.symvers will be generated. I also have MODVERSIONS and MODULE_SRCVERSION_ALL off.

I think external kernel modules are affected by what the kernel was built with. Maybe try adding -ffat-lto-objects to x11-drivers/nvidia-drivers?

barolo commented 4 years ago

Finally managed to get things sorted out and have some sane results: it's a Zen 5.8 -O3, native vs Zen 5.8 -O3, native, graphite, LTO @telans [ it's a 4 core , AMD APU ]

G'MIC [ graphics processing suite ] ``` G'MIC Test: 2D Function Plotting, 1000 Times Seconds < Lower Is Better gmic-zen-LTO_0 ... 120.13 |============================================= gmic-zen_0 ....... 134.51 |================================================== G'MIC Test: Plotting Isosurface Of A 3D Volume, 1000 Times Seconds < Lower Is Better gmic-zen-LTO_0 .. 17.11 |=========================================== gmic-zen_0 ...... 20.17 |=================================================== G'MIC Test: 3D Elevated Function In Random Colors, 100 Times Seconds < Lower Is Better gmic-zen-LTO_0 .. 96.70 |=========================================== gmic-zen_0 ...... 111.35 |================================================== ```
GL-vs-VK [ Vulkan versus OpenGL GPU tests ] ``` GL-vs-VK 2017-06-05 Test: Static Scene - API: OpenGL - Multi-Threaded: No Frame Time - ms < Lower Is Better gl-vs-vk-zen-LTO_-0 . 156.71 |========================================== gl-vs-vk-zen_-0 ..... 171.59 |============================================== GL-vs-VK 2017-06-05 Test: Static Scene - API: OpenGL - Multi-Threaded: No FPS > Higher Is Better gl-vs-vk-zen-LTO_-0 . 6.381339 |============================================ gl-vs-vk-zen_-0 ..... 5.827955 |======================================== GL-vs-VK 2017-06-05 Test: Static Scene - API: Vulkan - Multi-Threaded: No Frame Time - ms < Lower Is Better gl-vs-vk-zen-LTO_-0 . 108.84 |============================================= gl-vs-vk-zen_-0 ..... 112.11 |============================================== GL-vs-VK 2017-06-05 Test: Static Scene - API: Vulkan - Multi-Threaded: No FPS > Higher Is Better gl-vs-vk-zen-LTO_-0 . 9.187918 |============================================ gl-vs-vk-zen_-0 ..... 8.919964 |=========================================== GL-vs-VK 2017-06-05 Test: Static Scene - API: OpenGL - Multi-Threaded: Yes Frame Time - ms < Lower Is Better gl-vs-vk-zen-LTO_-0 . 102.76 |============================================ gl-vs-vk-zen_-0 ..... 107.34 |============================================== GL-vs-VK 2017-06-05 Test: Static Scene - API: OpenGL - Multi-Threaded: Yes FPS > Higher Is Better gl-vs-vk-zen-LTO_-0 . 9.731480 |============================================ gl-vs-vk-zen_-0 ..... 9.316619 |========================================== GL-vs-VK 2017-06-05 Test: Static Scene - API: Vulkan - Multi-Threaded: Yes Frame Time - ms < Lower Is Better gl-vs-vk-zen-LTO_-0 . 40.47 |============================================== gl-vs-vk-zen_-0 ..... 41.63 |=============================================== GL-vs-VK 2017-06-05 Test: Static Scene - API: Vulkan - Multi-Threaded: Yes FPS > Higher Is Better gl-vs-vk-zen-LTO_-0 . 24.71 |=============================================== gl-vs-vk-zen_-0 ..... 24.02 |============================================= GL-vs-VK 2017-06-05 Test: Shadow Mapping - API: OpenGL - Multi-Threaded: No Frame Time - ms < Lower Is Better gl-vs-vk-zen-LTO_-0 . 6.757658 |==================================== gl-vs-vk-zen_-0 ..... 8.352796 |============================================ GL-vs-VK 2017-06-05 Test: Shadow Mapping - API: OpenGL - Multi-Threaded: No FPS > Higher Is Better gl-vs-vk-zen-LTO_-0 . 147.98 |============================================== gl-vs-vk-zen_-0 ..... 119.72 |===================================== GL-vs-VK 2017-06-05 Test: Shadow Mapping - API: Vulkan - Multi-Threaded: No Frame Time - ms < Lower Is Better gl-vs-vk-zen-LTO_-0 . 4.987357 |=========================================== gl-vs-vk-zen_-0 ..... 5.029075 |============================================ GL-vs-VK 2017-06-05 Test: Shadow Mapping - API: Vulkan - Multi-Threaded: No FPS > Higher Is Better gl-vs-vk-zen-LTO_-0 . 200.51 |============================================== gl-vs-vk-zen_-0 ..... 197.91 |============================================= GL-vs-VK 2017-06-05 Test: Shadow Mapping - API: Vulkan - Multi-Threaded: Yes Frame Time - ms < Lower Is Better gl-vs-vk-zen-LTO_-0 . 4.797430 |============================================ gl-vs-vk-zen_-0 ..... 4.820551 |============================================ GL-vs-VK 2017-06-05 Test: Shadow Mapping - API: Vulkan - Multi-Threaded: Yes FPS > Higher Is Better gl-vs-vk-zen-LTO_-0 . 208.45 |============================================== gl-vs-vk-zen_-0 ..... 207.45 |============================================== GL-vs-VK 2017-06-05 Test: Terrain With Dynamic LoD - API: OpenGL - Multi-Threaded: No Frame Time - ms < Lower Is Better gl-vs-vk-zen-LTO_-0 . 23.71 |===================================== gl-vs-vk-zen_-0 ..... 29.72 |=============================================== GL-vs-VK 2017-06-05 Test: Terrain With Dynamic LoD - API: OpenGL - Multi-Threaded: No FPS > Higher Is Better gl-vs-vk-zen-LTO_-0 . 42.17 |=============================================== gl-vs-vk-zen_-0 ..... 33.64 |===================================== GL-vs-VK 2017-06-05 Test: Terrain With Dynamic LoD - API: Vulkan - Multi-Threaded: No Frame Time - ms < Lower Is Better gl-vs-vk-zen-LTO_-0 . 28.40 |============================================ gl-vs-vk-zen_-0 ..... 30.33 |=============================================== GL-vs-VK 2017-06-05 Test: Terrain With Dynamic LoD - API: Vulkan - Multi-Threaded: No FPS > Higher Is Better gl-vs-vk-zen-LTO_-0 . 35.21 |=============================================== gl-vs-vk-zen_-0 ..... 32.97 |============================================ GL-vs-VK 2017-06-05 Test: Terrain With Dynamic LoD - API: Vulkan - Multi-Threaded: Yes Frame Time - ms < Lower Is Better gl-vs-vk-zen-LTO_-0 . 20.50 |============================================== gl-vs-vk-zen_-0 ..... 21.04 |=============================================== GL-vs-VK 2017-06-05 Test: Terrain With Dynamic LoD - API: Vulkan - Multi-Threaded: Yes FPS > Higher Is Better gl-vs-vk-zen-LTO_-0 . 48.79 |=============================================== gl-vs-vk-zen_-0 ..... 47.57 |============================================== ```

I don't know what's going on with a single threaded OpenGL but I could tell the difference even without that bench... The results for a non-Zen kernel are even worse

telans commented 4 years ago

run make mrproper and move the config back. After rebuilding the kernel a new Module.symvers will be generated

Thanks! I believe that's what fixed it. Currently booted now with no issues.

jiblime commented 4 years ago

Test parameters: GCC 10.2.0 binutils 2.35 glibc 2.30-r9 Ran in tty CONFIG_GENERIC_CPU=y (-mtune=generic, -march not set) LTO patch, genpatches 1000-4567 (none of which affect performance)

Notes about testing: Phoronix Test Suite v9.8.0 (Nesodden) pts/stress-ng requires sys-libs/libapparmor to link -lapparmor for tests 7-8% deviation on CPU Cache for both results, removed in later testing


Kernel 5.8.0, both -O2 Baseline: no-lto Result: lto

Size comparisons, using lz4 compression no-LTO -O2: 15939744 LTO -O2: 15464160 ~3% size decrease

Results to baseline | Test | Configuration | Relative | | --------- | --------------- | -------- | | hackbench | 1 - Thread | 1.032 | | hackbench | 2 - Thread | 1.051 | | hackbench | 1 - Process | 1.029 | | hackbench | 16 - Thread | 1.052 | | hackbench | 2 - Process | 1.042 | | hackbench | 16 - Process | 1.024 | | stress-ng | Crypto | 0.998 | | stress-ng | Malloc | 0.997 | | stress-ng | Forking | 1.028 | | stress-ng | CPU Cache | 1.024 | | stress-ng | Vector Math | 0.998 | | stress-ng | Memory Copying | 1.002 | | stress-ng | Socket Activity | 1.016 |
Details for -O2 vs. -O2 LTO ``` Hackbench Count: 1 - Type: Thread Seconds < Lower Is Better lto o2 generic .... 4.569 |===================================================================================================================================================================================================================================================================================== no lto o2 generic . 4.717 |============================================================================================================================================================================================================================================================================================== Hackbench Count: 2 - Type: Thread Seconds < Lower Is Better lto o2 generic .... 6.047 |================================================================================================================================================================================================================================================================================ no lto o2 generic . 6.357 |============================================================================================================================================================================================================================================================================================== Hackbench Count: 1 - Type: Process Seconds < Lower Is Better lto o2 generic .... 4.352 |====================================================================================================================================================================================================================================================================================== no lto o2 generic . 4.477 |============================================================================================================================================================================================================================================================================================== Hackbench Count: 16 - Type: Thread Seconds < Lower Is Better lto o2 generic .... 43.48 |================================================================================================================================================================================================================================================================================ no lto o2 generic . 45.73 |============================================================================================================================================================================================================================================================================================== Hackbench Count: 2 - Type: Process Seconds < Lower Is Better lto o2 generic .... 5.741 |================================================================================================================================================================================================================================================================================== no lto o2 generic . 5.982 |============================================================================================================================================================================================================================================================================================== Hackbench Count: 16 - Type: Process Seconds < Lower Is Better lto o2 generic .... 42.57 |======================================================================================================================================================================================================================================================================================= no lto o2 generic . 43.59 |============================================================================================================================================================================================================================================================================================== Stress-NG 0.11.07 Test: Crypto Bogo Ops/s > Higher Is Better lto o2 generic .... 1574.10 |============================================================================================================================================================================================================================================================================================ no lto o2 generic . 1576.77 |============================================================================================================================================================================================================================================================================================ Stress-NG 0.11.07 Test: Malloc Bogo Ops/s > Higher Is Better lto o2 generic .... 58909034.94 |======================================================================================================================================================================================================================================================================================= no lto o2 generic . 59083725.81 |======================================================================================================================================================================================================================================================================================== Stress-NG 0.11.07 Test: Forking Bogo Ops/s > Higher Is Better lto o2 generic .... 28953.09 |=========================================================================================================================================================================================================================================================================================== no lto o2 generic . 28167.74 |=================================================================================================================================================================================================================================================================================== Stress-NG 0.11.07 Test: CPU Cache Bogo Ops/s > Higher Is Better lto o2 generic .... 28.55 |============================================================================================================================================================================================================================================================================================== no lto o2 generic . 27.87 |======================================================================================================================================================================================================================================================================================= Stress-NG 0.11.07 Test: Vector Math Bogo Ops/s > Higher Is Better lto o2 generic .... 53981.76 |========================================================================================================================================================================================================================================================================================== no lto o2 generic . 54099.63 |=========================================================================================================================================================================================================================================================================================== Stress-NG 0.11.07 Test: Memory Copying Bogo Ops/s > Higher Is Better lto o2 generic .... 9299.38 |============================================================================================================================================================================================================================================================================================ no lto o2 generic . 9282.92 |=========================================================================================================================================================================================================================================================================================== Stress-NG 0.11.07 Test: Socket Activity Bogo Ops/s > Higher Is Better lto o2 generic .... 7781.26 |============================================================================================================================================================================================================================================================================================ no lto o2 generic . 7656.45 |======================================================================================================================================================================================================================================================================================= ```

Summary: clear performance benefit to using an LTO kernel


Kernel 5.8.0, both LTO Baseline: LTO -O2 Result: LTO -O3 -fno-inline-functions

Note: ~3% deviation on memory copying, unsure why

Results to baseline | Test | Configuration | Relative | | --------- | --------------- | -------- | | hackbench | 1 - Thread | 0.984 | | hackbench | 2 - Thread | 0.976 | | hackbench | 1 - Process | 0.98 | | hackbench | 16 - Thread | 0.969 | | hackbench | 2 - Process | 0.967 | | hackbench | 16 - Process | 0.976 | | stress-ng | Crypto | 0.999 | | stress-ng | Malloc | 1.02 | | stress-ng | Forking | 0.97 | | stress-ng | Vector Math | 1 | | stress-ng | Memory Copying | 1.045 | | stress-ng | Socket Activity | 0.981 |
Results to baseline: no-lto -O2 | Test | Configuration | Relative | | --------- | --------------- | -------- | | hackbench | 1 - Thread | 1.016 | | hackbench | 2 - Thread | 1.026 | | hackbench | 1 - Process | 1.008 | | hackbench | 16 - Thread | 1.019 | | hackbench | 2 - Process | 1.008 | | hackbench | 16 - Process | 0.999 | | stress-ng | Crypto | 0.997 | | stress-ng | Malloc | 1.017 | | stress-ng | Forking | 0.997 | | stress-ng | Vector Math | 0.997 | | stress-ng | Memory Copying | 1.047 | | stress-ng | Socket Activity | 0.997 |
Details for LTO -O2 vs. LTO -O3 -fno-inline-functions ``` Hackbench Count: 1 - Type: Thread Seconds < Lower Is Better lto o2 generic ............ 4.569 |=========================================================== O3LTO-fnoinline-functions . 4.645 |============================================================ Hackbench Count: 2 - Type: Thread Seconds < Lower Is Better lto o2 generic ............ 6.047 |=========================================================== O3LTO-fnoinline-functions . 6.197 |============================================================ Hackbench Count: 1 - Type: Process Seconds < Lower Is Better lto o2 generic ............ 4.352 |=========================================================== O3LTO-fnoinline-functions . 4.443 |============================================================ Hackbench Count: 16 - Type: Thread Seconds < Lower Is Better lto o2 generic ............ 43.48 |========================================================== O3LTO-fnoinline-functions . 44.89 |============================================================ Hackbench Count: 2 - Type: Process Seconds < Lower Is Better lto o2 generic ............ 5.741 |========================================================== O3LTO-fnoinline-functions . 5.934 |============================================================ Hackbench Count: 16 - Type: Process Seconds < Lower Is Better lto o2 generic ............ 42.57 |=========================================================== O3LTO-fnoinline-functions . 43.63 |============================================================ Stress-NG 0.11.07 Test: Crypto Bogo Ops/s > Higher Is Better lto o2 generic ............ 1574.10 |========================================================== O3LTO-fnoinline-functions . 1572.03 |========================================================== Stress-NG 0.11.07 Test: Malloc Bogo Ops/s > Higher Is Better lto o2 generic ............ 58909034.94 |===================================================== O3LTO-fnoinline-functions . 60113960.06 |====================================================== Stress-NG 0.11.07 Test: Forking Bogo Ops/s > Higher Is Better lto o2 generic ............ 28953.09 |========================================================= O3LTO-fnoinline-functions . 28078.41 |======================================================= Stress-NG 0.11.07 Test: Vector Math Bogo Ops/s > Higher Is Better lto o2 generic ............ 53981.76 |========================================================= O3LTO-fnoinline-functions . 53960.10 |========================================================= Stress-NG 0.11.07 Test: Memory Copying Bogo Ops/s > Higher Is Better lto o2 generic ............ 9299.38 |======================================================== O3LTO-fnoinline-functions . 9716.74 |========================================================== Stress-NG 0.11.07 Test: Socket Activity Bogo Ops/s > Higher Is Better lto o2 generic ............ 7781.26 |========================================================== O3LTO-fnoinline-functions . 7632.56 |========================================================= ```
Details for -O2, LTO -O2, and LTO -O3 -fno-inline-functions ``` Hackbench Count: 1 - Type: Thread Seconds < Lower Is Better no lto o2 generic ......... 4.717 |============================================ lto o2 generic ............ 4.569 |=========================================== O3LTO-fnoinline-functions . 4.645 |=========================================== Hackbench Count: 2 - Type: Thread Seconds < Lower Is Better no lto o2 generic ......... 6.357 |============================================ lto o2 generic ............ 6.047 |========================================== O3LTO-fnoinline-functions . 6.197 |=========================================== Hackbench Count: 1 - Type: Process Seconds < Lower Is Better no lto o2 generic ......... 4.477 |============================================ lto o2 generic ............ 4.352 |=========================================== O3LTO-fnoinline-functions . 4.443 |============================================ Hackbench Count: 16 - Type: Thread Seconds < Lower Is Better no lto o2 generic ......... 45.73 |============================================ lto o2 generic ............ 43.48 |========================================== O3LTO-fnoinline-functions . 44.89 |=========================================== Hackbench Count: 2 - Type: Process Seconds < Lower Is Better no lto o2 generic ......... 5.982 |============================================ lto o2 generic ............ 5.741 |========================================== O3LTO-fnoinline-functions . 5.934 |============================================ Hackbench Count: 16 - Type: Process Seconds < Lower Is Better no lto o2 generic ......... 43.59 |============================================ lto o2 generic ............ 42.57 |=========================================== O3LTO-fnoinline-functions . 43.63 |============================================ Stress-NG 0.11.07 Test: Crypto Bogo Ops/s > Higher Is Better no lto o2 generic ......... 1576.77 |========================================== lto o2 generic ............ 1574.10 |========================================== O3LTO-fnoinline-functions . 1572.03 |========================================== Stress-NG 0.11.07 Test: Malloc Bogo Ops/s > Higher Is Better no lto o2 generic ......... 59083725.81 |===================================== lto o2 generic ............ 58909034.94 |===================================== O3LTO-fnoinline-functions . 60113960.06 |====================================== Stress-NG 0.11.07 Test: Forking Bogo Ops/s > Higher Is Better no lto o2 generic ......... 28167.74 |======================================== lto o2 generic ............ 28953.09 |========================================= O3LTO-fnoinline-functions . 28078.41 |======================================== Stress-NG 0.11.07 Test: Vector Math Bogo Ops/s > Higher Is Better no lto o2 generic ......... 54099.63 |========================================= lto o2 generic ............ 53981.76 |========================================= O3LTO-fnoinline-functions . 53960.10 |========================================= Stress-NG 0.11.07 Test: Memory Copying Bogo Ops/s > Higher Is Better no lto o2 generic ......... 9282.92 |======================================== lto o2 generic ............ 9299.38 |======================================== O3LTO-fnoinline-functions . 9716.74 |========================================== Stress-NG 0.11.07 Test: Socket Activity Bogo Ops/s > Higher Is Better no lto o2 generic ......... 7656.45 |========================================= lto o2 generic ............ 7781.26 |========================================== O3LTO-fnoinline-functions . 7632.56 |========================================= ```

Summary: LTO -O3 with -fno-inline-functions performs worse against LTO -O2, and performs better or the same as no-LTO -O2


Kernel 5.8.0, both LTO Baseline: LTO -O2 Result: LTO -O3

Size comparisons, using lz4 compression no-LTO -O2: 15939744 LTO -O2: 15464160 LTO -O3: 16665088 ~4.4% size increase against no-LTO -O2, ~7.4% increase compared against LTO -O2

Results to baseline | Test | Configuration | Relative | | --------- | --------------- | -------- | | hackbench | 1 - Thread | 0.99 | | hackbench | 2 - Thread | 1.022 | | hackbench | 1 - Process | 0.989 | | hackbench | 16 - Thread | 0.988 | | hackbench | 2 - Process | 1.01 | | hackbench | 16 - Process | 1.003 | | stress-ng | Crypto | 0.998 | | stress-ng | Malloc | 1.017 | | stress-ng | Forking | 0.98 | | stress-ng | Vector Math | 0.996 | | stress-ng | Memory Copying | 1.005 | | stress-ng | Socket Activity | 1.009 |
Results to baseline: LTO -O3 -fno-inline-functions | Test | Configuration | Relative | | --------- | --------------- | -------- | | hackbench | 1 - Thread | 1.007 | | hackbench | 2 - Thread | 1.047 | | hackbench | 1 - Process | 1.01 | | hackbench | 16 - Thread | 1.02 | | hackbench | 2 - Process | 1.044 | | hackbench | 16 - Process | 1.028 | | stress-ng | Crypto | 0.999 | | stress-ng | Malloc | 0.996 | | stress-ng | Forking | 1.01 | | stress-ng | Vector Math | 0.997 | | stress-ng | Memory Copying | 0.962 | | stress-ng | Socket Activity | 1.028 |
Details for LTO -O2 vs. LTO -O3 ``` Hackbench Count: 1 - Type: Thread Seconds < Lower Is Better O3lto-inline-functions . 4.614 |============================================================= lto o2 generic ......... 4.569 |============================================================ Hackbench Count: 2 - Type: Thread Seconds < Lower Is Better O3lto-inline-functions . 5.918 |============================================================ lto o2 generic ......... 6.047 |============================================================= Hackbench Count: 1 - Type: Process Seconds < Lower Is Better O3lto-inline-functions . 4.399 |============================================================= lto o2 generic ......... 4.352 |============================================================ Hackbench Count: 16 - Type: Thread Seconds < Lower Is Better O3lto-inline-functions . 44.00 |============================================================= lto o2 generic ......... 43.48 |============================================================ Hackbench Count: 2 - Type: Process Seconds < Lower Is Better O3lto-inline-functions . 5.685 |============================================================ lto o2 generic ......... 5.741 |============================================================= Hackbench Count: 16 - Type: Process Seconds < Lower Is Better O3lto-inline-functions . 42.42 |============================================================= lto o2 generic ......... 42.57 |============================================================= Stress-NG 0.11.07 Test: Crypto Bogo Ops/s > Higher Is Better O3lto-inline-functions . 1570.90 |=========================================================== lto o2 generic ......... 1574.10 |=========================================================== Stress-NG 0.11.07 Test: Malloc Bogo Ops/s > Higher Is Better O3lto-inline-functions . 59900927.06 |======================================================= lto o2 generic ......... 58909034.94 |====================================================== Stress-NG 0.11.07 Test: Forking Bogo Ops/s > Higher Is Better O3lto-inline-functions . 28364.34 |========================================================= lto o2 generic ......... 28953.09 |========================================================== Stress-NG 0.11.07 Test: Vector Math Bogo Ops/s > Higher Is Better O3lto-inline-functions . 53790.84 |========================================================== lto o2 generic ......... 53981.76 |========================================================== Stress-NG 0.11.07 Test: Memory Copying Bogo Ops/s > Higher Is Better O3lto-inline-functions . 9342.95 |=========================================================== lto o2 generic ......... 9299.38 |=========================================================== Stress-NG 0.11.07 Test: Socket Activity Bogo Ops/s > Higher Is Better O3lto-inline-functions . 7849.58 |=========================================================== lto o2 generic ......... 7781.26 |========================================================== ```
Details for no-LTO -O2, LTO -O2, LTO -O3 -fno-inline-functions, LTO -O3 ``` Hackbench Count: 1 - Type: Thread Seconds < Lower Is Better O3lto-inline-functions .... 4.614 |========================================================= lto o2 generic ............ 4.569 |======================================================== no lto o2 generic ......... 4.717 |========================================================== O3LTO-fnoinline-functions . 4.645 |========================================================= Hackbench Count: 2 - Type: Thread Seconds < Lower Is Better O3lto-inline-functions .... 5.918 |====================================================== lto o2 generic ............ 6.047 |======================================================= no lto o2 generic ......... 6.357 |========================================================== O3LTO-fnoinline-functions . 6.197 |========================================================= Hackbench Count: 1 - Type: Process Seconds < Lower Is Better O3lto-inline-functions .... 4.399 |========================================================= lto o2 generic ............ 4.352 |======================================================== no lto o2 generic ......... 4.477 |========================================================== O3LTO-fnoinline-functions . 4.443 |========================================================== Hackbench Count: 16 - Type: Thread Seconds < Lower Is Better O3lto-inline-functions .... 44.00 |======================================================== lto o2 generic ............ 43.48 |======================================================= no lto o2 generic ......... 45.73 |========================================================== O3LTO-fnoinline-functions . 44.89 |========================================================= Hackbench Count: 2 - Type: Process Seconds < Lower Is Better O3lto-inline-functions .... 5.685 |======================================================= lto o2 generic ............ 5.741 |======================================================== no lto o2 generic ......... 5.982 |========================================================== O3LTO-fnoinline-functions . 5.934 |========================================================== Hackbench Count: 16 - Type: Process Seconds < Lower Is Better O3lto-inline-functions .... 42.42 |======================================================== lto o2 generic ............ 42.57 |========================================================= no lto o2 generic ......... 43.59 |========================================================== O3LTO-fnoinline-functions . 43.63 |========================================================== Stress-NG 0.11.07 Test: Crypto Bogo Ops/s > Higher Is Better O3lto-inline-functions .... 1570.90 |======================================================== lto o2 generic ............ 1574.10 |======================================================== no lto o2 generic ......... 1576.77 |======================================================== O3LTO-fnoinline-functions . 1572.03 |======================================================== Stress-NG 0.11.07 Test: Malloc Bogo Ops/s > Higher Is Better O3lto-inline-functions .... 59900927.06 |==================================================== lto o2 generic ............ 58909034.94 |=================================================== no lto o2 generic ......... 59083725.81 |=================================================== O3LTO-fnoinline-functions . 60113960.06 |==================================================== Stress-NG 0.11.07 Test: Forking Bogo Ops/s > Higher Is Better O3lto-inline-functions .... 28364.34 |====================================================== lto o2 generic ............ 28953.09 |======================================================= no lto o2 generic ......... 28167.74 |====================================================== O3LTO-fnoinline-functions . 28078.41 |===================================================== Stress-NG 0.11.07 Test: Vector Math Bogo Ops/s > Higher Is Better O3lto-inline-functions .... 53790.84 |======================================================= lto o2 generic ............ 53981.76 |======================================================= no lto o2 generic ......... 54099.63 |======================================================= O3LTO-fnoinline-functions . 53960.10 |======================================================= Stress-NG 0.11.07 Test: Memory Copying Bogo Ops/s > Higher Is Better O3lto-inline-functions .... 9342.95 |====================================================== lto o2 generic ............ 9299.38 |====================================================== no lto o2 generic ......... 9282.92 |===================================================== O3LTO-fnoinline-functions . 9716.74 |======================================================== Stress-NG 0.11.07 Test: Socket Activity Bogo Ops/s > Higher Is Better O3lto-inline-functions .... 7849.58 |======================================================== lto o2 generic ............ 7781.26 |======================================================== no lto o2 generic ......... 7656.45 |======================================================= O3LTO-fnoinline-functions . 7632.56 |====================================================== ```

Summary: fno-inline-functions has an observable impact on performance. LTO -O2 performs marginally better than LTO -O3 overall, and is smaller too.

It'd be better if I had more tests to reduce variation, also tested the effect of setting -march. Would GCC have been able to take advantage of my processor cache better? I found out recently that if you don't set -march=native, these defaults will be set: --param=l1-cache-line-size=32 & --param=l1-cache-size=64 unless user specified.

telans commented 4 years ago

I'll run those same benchmarks later, however, currently perf is giving me approximately 20% lower results on an lto kernel. Using KCFLAGS=-fno-inline-functions -fno-inline-functions-called-once --param=large-stack-frame-growth=100 -fdevirtualize-at-ltrans -fgraphite-identity -floop-nest-optimize but I'll try with default flags.

jiblime commented 4 years ago

After some rebuilding, I've found -fno-inline-functions is enough to stop the warnings when using -O3. Still need to test if -O3 is faster, with the inlining flags I mentioned LTO -O3 is actually slower than LTO -O2 😞

barolo commented 4 years ago

Successful building/booting summary:

[ that's what's working for me ]

@jiblime got everything?