ClangBuiltLinux / linux

Linux kernel source tree
Other
241 stars 14 forks source link

upstream PGO #1074

Open nickdesaulniers opened 4 years ago

nickdesaulniers commented 4 years ago

@samitolvanen did some work on intrumented PGO in https://android-review.googlesource.com/q/topic:android-4.14-pgo. We're probably not going to ship this in Android, but it might be interesting to submit this upstream so that others can play with it. Also, @gwelymernans did some work with this series, too.

nickdesaulniers commented 3 years ago

Reopening: by "upstream" I meant into the mainline Linux kernel proper.

bwendling commented 3 years ago

I wanted to get local feedback first before I send the patch upstream.

nickdesaulniers commented 3 years ago

v8: https://lore.kernel.org/lkml/20210226222030.3718075-1-morbo@google.com/ https://www.phoronix.com/scan.php?page=news_item&px=Clang-PGO-Linux-Kernel

nickdesaulniers commented 3 years ago

cc @kees (friendly ping)

kees commented 3 years ago

Yup, the plan is to put this in -next after -rc2 is out.

nickdesaulniers commented 3 years ago

accepted: https://git.kernel.org/pub/scm/linux/kernel/git/kees/linux.git/commit/?h=for-next/clang/pgo&id=e1af496cbe9b4517428601a4e44fee3602dd3c15

MaskRay commented 3 years ago

@kees The two symbols can be deleted.

--- i/include/asm-generic/vmlinux.lds.h
+++ w/include/asm-generic/vmlinux.lds.h
@@ -332,7 +332,6 @@
 #ifdef CONFIG_PGO_CLANG
 #define PGO_CLANG_DATA                                                 \
        __llvm_prf_data : AT(ADDR(__llvm_prf_data) - LOAD_OFFSET) {     \
-               __llvm_prf_start = .;                                   \
                __llvm_prf_data_start = .;                              \
                *(__llvm_prf_data)                                      \
                __llvm_prf_data_end = .;                                \
@@ -356,7 +355,6 @@
                __llvm_prf_vnds_start = .;                              \
                *(__llvm_prf_vnds)                                      \
                __llvm_prf_vnds_end = .;                                \
-               __llvm_prf_end = .;                                     \
        }
 #else
 #define PGO_CLANG_DATA
kees commented 3 years ago

@kees The two symbols can be deleted.

--- i/include/asm-generic/vmlinux.lds.h
+++ w/include/asm-generic/vmlinux.lds.h
@@ -332,7 +332,6 @@
 #ifdef CONFIG_PGO_CLANG
 #define PGO_CLANG_DATA                                                 \
        __llvm_prf_data : AT(ADDR(__llvm_prf_data) - LOAD_OFFSET) {     \
-               __llvm_prf_start = .;                                   \
                __llvm_prf_data_start = .;                              \
                *(__llvm_prf_data)                                      \
                __llvm_prf_data_end = .;                                \
@@ -356,7 +355,6 @@
                __llvm_prf_vnds_start = .;                              \
                *(__llvm_prf_vnds)                                      \
                __llvm_prf_vnds_end = .;                                \
-               __llvm_prf_end = .;                                     \
        }
 #else
 #define PGO_CLANG_DATA

Thanks! I've squashed this into the patch now.

nickdesaulniers commented 3 years ago

upstream thread: https://lore.kernel.org/lkml/CAHk-=whqCT0BeqBQhW8D-YoLLgp_eFY=8Y=9ieREM5xx0ef08w@mail.gmail.com/

nickdesaulniers commented 3 years ago

Feedback from Linus was mostly around the kernel/userspace interface. Linus wasn't too happy about the sysfs nodes, and custom binary blob that will change over LLVM revisions.

Some other points in favor of PGO that weren't mentioned on the thread upstream:

  1. We can't use LBR from virtualized guests. With more developers moving to the cloud, this gets annoying trying to profile kernels unless they can be run on metal. It seems like there are patches for this, but it's probably also up to sys admins whether they expose these to guests. I think Linus prefers for hardware vendors to sort out LBR (or equivalent).
  2. Profiling boot. Even if init (pid 0) started perf, I don't think we could profile .init code in the kernel. It would be nice to boot faster.

I'm going to mark this patches rejected, but I'm not sure I'm ready to close this one out quite yet. Perhaps let's focus on AutoFDO in #1425 , then revisit this with some numbers. I'm sure kernel devs are interested to know if PGO or AutoFDO improve their kernel compile times, for example.

bwendling commented 1 year ago

I have an issue sent in to autofdo regarding this patch: https://github.com/google/autofdo/issues/144. It seems like it's not able to process the perf file I generated...

bwendling commented 1 year ago

We need to add this patch to the PGO patch set when we re-try upstreaming:

From a4f1db0d2415d2839f09799092007edc42fd8944 Mon Sep 17 00:00:00 2001
From: Nick Desaulniers <ndesaulniers@google.com>
Date: Mon, 19 Sep 2022 15:01:52 -0700
Subject: [PATCH] x86: Makefile: disable PGO for alternatives.o

Upon upgrading the toolchain, PGO instrumented training builds failed to
boot on some machines. The call to boot_cpu_has() in text_poke_early()
looks large, but the call to __builtin_constant_p() folds away at
compile time.  LLVM is now more aggressive about inlining
text_poke_early into call sites. Forcing the
boot_cpu_has(X86_FEATURE_NX) && is_module_text_address(...)
check in text_poke_early() also allows the affected machines to boot.

For now, disable PGO instrumentation in the whole translation unit. A
more fine grain fix in the future may be to mark text_poke_early()
noinline, or further runtime debugging of the condition in
text_poke_early() on the specific machines.

Link: https://reviews.llvm.org/D111272
Tested: Boot test on affected machine
Effort: core/clang
Google-Bug-Id: 233798336
Signed-off-by: Nick Desaulniers <ndesaulniers@google.com>
Change-Id: I6926c45b7d8b4f47b9fe2c7799d190ad11ad6f8b
---

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 1ba19cc..c72a681 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -162,4 +162,5 @@
 endif
 obj-$(CONFIG_STUN) += stun.o

-PGO_PROFILE_head64.o := n
+PGO_PROFILE_head64.o       := n
+PGO_PROFILE_alternative.o  := n
nickdesaulniers commented 1 year ago

https://protools19.github.io/slides/Eranian_KeynoteSC19.pdf has some nice slides demonstrating PGO vs AutoFDO. Might be nice to reuse in certain presenations.

ms178 commented 8 months ago

@bwendling @nickdesaulniers I was using a re-based Clang-PGO patchset for the Kernel from the CachyOS project for some time on my Haswell box. Unfortunately it seems to be broken with Kernel 6.7 (with a recent Clang-18 snapshot), the instrumented Kernel doesn't boot, which prevents me from getting fresh profiles.

As I've noticed quite a lot of benefits using a PGOed Kernel in compile times, I'd kindly ask about the status of the PGO effort in the Kernel in general. I could imagine that PGO + BOLT could yield even more benefits. Are there any plans you can share publicly?

bwendling commented 8 months ago

@ms178 It's on my list of things to do, but at a lower priority. What we're probably going to focus on though is an "AutoFDO" type of implementation, where we take a perf trace of the kernel and use that as our profile. It has the benefit of not requiring the kernel to carry around Clang-specific code that needs to be co-maintained with a Clang release.

Back to your issue, does the new kernel still work with Clang-17?

ms178 commented 8 months ago

@bwendling To my honest surprise, Clang-17 seems to work just fine with 6.7-rc8 and the Clang-PGO patch. I was able to boot the instrumented Kernel and even got a FullLTO build with a fresh profile to work (using Clang-17). FullLTO did lead to a non-booting Kernel before when using a fresh profile, however it would begin to work when the profile got older. ThinLTO on the other hand never showed problems with fresh profiles.

Update: Today, I've tested the combination of using the Clang-17 profile with a newer Snapshot of Clang-18 (+FullLTO) and that yielded a working Kernel, too.

ms178 commented 1 month ago

There has been some progress with the AutoFDO approach recently, below are some links for anyone reading this thread:

Kernel patches for the AutoFDO approach: https://lore.kernel.org/lkml/20240728203001.2551083-1-xur@google.com/T/#r7e3b216deee5e9108a8489edf242d65b51f645d5

See also the discussion with more details on the LLVM discourse page: https://discourse.llvm.org/t/optimizing-the-linux-kernel-with-autofdo-including-thinlto-and-propeller/79108/10

Are there plans to upstream AutoFDO and Propeller support into LLVM eventually? I am also thinking of distributions like CachyOS that might want to evalutate this approach. Having AutoFDO and Propeller out-of-tree might be a hurdle for broader adoption.