arm64, x86_64: why are CONFIG_LTO_CLANG_FULL kernels much bigger?

michaelopdenacker commented 2 years ago

Greetings

I expected the LTO optimized kernels to be smaller because of the elimination of dead code, as described on https://www.llvm.org/docs/LinkTimeOptimization.html

However, the LTO optimized compressed kernels are actually bigger.

On arm64, Linux 5.18-rc7, defconfig configuration, the Image.gz file size is: using gcc 12: 12149327 bytes using clang 14 with CONFIG_LTO_NONE=y: 12197028 bytes using clang 14 without CONFIG_LTO_CLANG_THIN: 12419217 bytes using clang 14 with CONFIG_LTO_CLANG_FULL: 16097599 bytes

On x86_64, Linux 5.18-rc7, x86_64_defconfig configuration file, the bzImage file size is: using gcc 12: 10743648 bytes using clang 14 with CONFIG_LTO_NONE=y: 11011008 bytes using clang without CONFIG_LTO_CLANG_THIN: 11166720 bytes using clang 14 with CONFIG_LTO_CLANG_FULL:13650880 bytes

How would you explain that the LTO_THIN and above all the LTO_FULL optimized kernels are actually bigger than the LTO_NONE ones, at least on the x86_64 and arm64 architectures?

Thanks in advance for your insights. Cheers Michael.

nathanchance commented 2 years ago

Hi Michael! As far as I am aware, the size increase comes from the linker’s ability to inline more aggressively, which is why you see a larger increase with full LTO over ThinLTO, as full LTO is basically like cating all object files together.

nickdesaulniers commented 2 years ago

Consider enabling CONFIG_LD_DEAD_CODE_ELIMINATION on top of LTO, though the resulting image might not necessarily boot (LD_DEAD_CODE_ELIMINATION has been hit or miss IMO).

michaelopdenacker commented 2 years ago

Thanks for the tip! On arm64, I tried to enable CONFIG_LD_DEAD_CODE_ELIMINATION (after adding "select HAVE_LD_DEAD_CODE_DATA_ELIMINATION" to arch/arm64/Kconfig) together with CONFIG_LTO_CLANG_FULL, and here's the size I get for Image.gz: 15301033 (vs 16097599 for LTO_CLANG_FULL alone) That's -5%, but that's still bigger than with LTO_NONE.

michaelopdenacker commented 2 years ago

Hi Michael! As far as I am aware, the size increase comes from the linker’s ability to inline more aggressively, which is why you see a larger increase with full LTO over ThinLTO, as full LTO is basically like cating all object files together.

Hi Nathan. It makes sense, thanks!

nickdesaulniers commented 2 years ago

That's -5%, but that's still bigger than with LTO_NONE.

Some other thoughts: LTO enables -ffunction-sections and -fdata-sections (so does CONFIG_LD_DEAD_CODE_ELIMINATION IIRC). This does waste space IIRC due to ELF section alignment requirements.

Also, even if many kernel interfaces are actually unused ever during runtime, if the interface is exported, the linker can't know that a module will never need such a symbol at runtime. If symbols are rooted (referenced) by other symbols that are exported, the code does not appear dead to the linker. (Though I wonder about --gc-keep-exported since we DONT set that).

There's probably an -Rpass-missed= flag that can be passed to aid in debugging. IDK if the linker has a corresponding flag to debug, probably --print-gc-sections?

Also

Makefile
900:KBUILD_LDFLAGS += -mllvm -import-instr-limit=5

that 5 is a bit of heuristic that can be played with to affect how frequently inlining is performed. From

llvm/lib/Transforms/IPO/FunctionImport.cpp
  76 /// Limit on instruction count of imported functions.                           
  77 static cl::opt<unsigned> ImportInstrLimit(                                      
  78     "import-instr-limit", cl::init(100), cl::Hidden, cl::value_desc("N"),                                                                                                                      
  79     cl::desc("Only import functions with less than N instructions"));

note these are LLVM IR instructions, not machine instructions.

nickdesaulniers commented 2 years ago

Also, even if many kernel interfaces are actually unused ever during runtime, if the interface is exported, the linker can't know that a module will never need such a symbol at runtime.

Ah, that's what CONFIG_TRIM_UNUSED_KSYMS is for.

BTW @michaelopdenacker nice talk! I've added a link to it from our wiki.

ClangBuiltLinux / linux

arm64, x86_64: why are CONFIG_LTO_CLANG_FULL kernels much bigger? #1643