Open E5ten opened 4 years ago
Steps to reproduce (from #986):
$ git clone git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
$ cd linux
$ curl -LSs https://gist.github.com/nathanchance/171b7d672e311b56b4329821b8a43acd/raw/9a1dbb1f11552d0b6efec48ac29505dd0c768d1b/20200401_jpoimboe_objtool_fixes.mbx | git apply -3v
$ curl -LSs https://lore.kernel.org/lkml/20200325231250.99205-1-ndesaulniers@google.com/raw | git apply -3v
$ ./scripts/config --file arch/x86/configs/x86_64_defconfig -e FUNCTION_TRACER
$ make -j$(nproc) -s LLVM=1 LLVM_IAS=1 O=out/x86_64 distclean defconfig bzImage
I did an integrated-as build and specifically added CFLAGS_
I assume something like this also needs to be done for recordmcount to fix this? https://lore.kernel.org/lkml/9a9cae7fcf628843aabe5a086b1a3c5bf50f42e8.1585761021.git.jpoimboe@redhat.com/
Just to clarify:
You use here LLVM_IAS=1
together with LLVM=1
.
yeah.
@E5ten
I switched over to use LLVM_IAS=1
together with LLVM=1
.
I also ran into this with LLVM_IAS=1
when building x86_64 defconfig with dynamic ftrace. Testing Peter's objtool mcount patch, I noticed that objtool segfaults for several object files because the files are missing STT_SECTION
symbols for some of the sections.
A random example, compiled with LLVM_IAS=1
:
$ readelf --sections arch/x86/mm/hugetlbpage.o | grep PROGBITS
[ 2] .text PROGBITS 0000000000000000 00000240
[ 4] .altinstructions PROGBITS 0000000000000000 000007c8
[ 6] .altinstr_re[...] PROGBITS 0000000000000000 00000890
[ 8] .altinstr_aux PROGBITS 0000000000000000 000008d0
[10] .init.text PROGBITS 0000000000000000 00000988
...
$ readelf --symbols arch/x86/mm/hugetlbpage.o | grep SECTION
3: 0000000000000000 0 SECTION LOCAL DEFAULT 2
4: 0000000000000000 0 SECTION LOCAL DEFAULT 6
5: 0000000000000000 0 SECTION LOCAL DEFAULT 8
Objtool fails here because .init.text
doesn't have a corresponding STT_SECTION
symbol. Without IAS, the symbol is generated:
$ readelf --sections arch/x86/mm/hugetlbpage.o | grep PROGBITS
[ 1] .text PROGBITS 0000000000000000 00000040
[ 3] .data PROGBITS 0000000000000000 000005c8
[ 5] .altinstructions PROGBITS 0000000000000000 000005c8
[ 7] .altinstr_re[...] PROGBITS 0000000000000000 00000690
[ 9] .altinstr_aux PROGBITS 0000000000000000 000006d0
[11] .init.text PROGBITS 0000000000000000 00000788
...
$ readelf --symbols arch/x86/mm/hugetlbpage.o | grep SECTION
2: 0000000000000000 0 SECTION LOCAL DEFAULT 1
3: 0000000000000000 0 SECTION LOCAL DEFAULT 3
4: 0000000000000000 0 SECTION LOCAL DEFAULT 4
5: 0000000000000000 0 SECTION LOCAL DEFAULT 5
6: 0000000000000000 0 SECTION LOCAL DEFAULT 7
7: 0000000000000000 0 SECTION LOCAL DEFAULT 9
9: 0000000000000000 0 SECTION LOCAL DEFAULT 11
...
Edit: OK, my issue looks similar to issue #669, but just in a different part of objtool. Specifically, the new static call processing code and the proposed mcount patch both depend on section symbols, so if either of these occur in a section for which a symbol is missing, objtool is going to segfault. This doesn't appear to be a problem with static calls right now (or we would have noticed it), but the mcount patch triggers this quite often. I fixed this in commit 54d837e5119bd5a15593820ca1585ca4e4f3e2a4 for now.
It sounds like CrOS is hitting this now trying to move to LLVM_IAS=1: https://bugs.chromium.org/p/chromium/issues/detail?id=1148073 cc @jcai19
With defconfig+FUNCTION_TRACER, I see this in:
init/initramfs.o kernel/elfcore.o
Sami, I think https://github.com/ClangBuiltLinux/linux/commit/54d837e5119bd5a15593820ca1585ca4e4f3e2a4 no longer applies on linux-next?
Sami, I think 54d837e no longer applies on linux-next?
That's because it only fixes the mcount pass (commit 0271fa5f8566b79f07c905922321ecc70b697b4c), which isn't upstream yet. You probably need an identical fix for the static call pass instead, assuming that's where it crashes.
Sami, I think 54d837e no longer applies on linux-next?
That's because it only fixes the mcount pass (commit 0271fa5), which isn't upstream yet.
May I know what dependencies are needed to back port https://github.com/ClangBuiltLinux/linux/commit/0271fa5f8566b79f07c905922321ecc70b697b4c and https://github.com/ClangBuiltLinux/linux/commit/54d837e5119bd5a15593820ca1585ca4e4f3e2a4 into 5.4? While trying to test them on 5.4, I realized there were many dependencies I needed to cherry-pick/back-port in order to apply these two patches cleanly. For example, https://github.com/ClangBuiltLinux/linux/commit/0271fa5f8566b79f07c905922321ecc70b697b4c seems to be based on upstream commit 0f1441b44e823a74f3f3780902a113e07c73fbf6, which is not in 5.4 yet, but I could not cherry-pick it into stable/linux-5.4.y branch cleanly as its dependencies were also missing.
You probably need an identical fix for the static call pass instead, assuming that's where it crashes.
Just to be clear, does that mean https://github.com/ClangBuiltLinux/linux/commit/0271fa5f8566b79f07c905922321ecc70b697b4c and https://github.com/ClangBuiltLinux/linux/commit/54d837e5119bd5a15593820ca1585ca4e4f3e2a4 are not enough to fix this issue? Thanks.
Just to be clear, does that mean 0271fa5 and 54d837e are not enough to fix this issue? Thanks.
After actually looking at the CrOS bug, I'm guessing it's the same as the original recordmcount issue and these objtool patches are not going to help here. Both issues have the same root cause though, Clang not always generating section symbols, but you'll need to fix this in recordmcount instead.
I think @arndb just sent patches for this that got picked up by akpm: https://lore.kernel.org/lkml/20201204165742.3815221-1-arnd@kernel.org/
The patches I sent just work around the problem by avoiding the weak functions in those files, the bug is still there and could show up any time another file has only weak functions in the .text section.
With these patches I was able to build and boot an x86_64 kernel with LLVM=1 and LLVM_IAS=1
Both patches in Linux v5.10
and linux-stable
trees recently carrying them.
$ git log --oneline | grep 'initramfs: fix clang build failure'
55d5b7dd6451 initramfs: fix clang build failure
$ git describe --contains 55d5b7dd6451
v5.10~14^2~3
$ git log --oneline | grep 'elfcore: fix building with clang'
6e7b64b9dd6d elfcore: fix building with clang
$ git describe --contains 6e7b64b9dd6d
v5.10~14^2~2
Looks like the PowerPC folks are getting bit by this too:
https://github.com/linuxppc/issues/issues/388
@emojifreak reported issues with ARCH=mips allmodconfig
+ CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y
:
$ echo "CONFIG_LD_DEAD_CODE_DATA_ELIMINATION=y
CONFIG_MIPS32_O32=n" >>kernel/configs/repro.config
$ make -skj"$(nproc)" ARCH=mips LLVM=1 distclean allmodconfig repro.config init/calibrate.o
...
Cannot find symbol for section 8: .text.calibrate_delay_is_known.
init/calibrate.o: failed
...
KCOV
helps reproduce it but I doubt it is strictly related to the issue. cvise
spits out:
$ cat calibrate.i
long __attribute__((weak)) calibrate_delay_is_known() { return 0; }
$ clang --target=mipsel-linux-gnu -fsanitize-coverage=trace-pc -ffunction-sections -c calibrate.i
$ ./recordmcount calibrate.o
Cannot find symbol for section 4: .text.calibrate_delay_is_known.
calibrate.o: failed
$ llvm-objdump -x calibrate.o
calibrate.o: file format elf32-mips
architecture: mipsel
start address: 0x00000000
Program Header:
Dynamic Section:
Sections:
Idx Name Size VMA Type
0 00000000 00000000
1 .strtab 000000c0 00000000
2 .text 00000000 00000000 TEXT
3 .mdebug.abi32 00000000 00000000
4 .text.calibrate_delay_is_known 00000034 00000000 TEXT
5 .rel.text.calibrate_delay_is_known 00000008 00000000
6 .pdr 00000020 00000000
7 .rel.pdr 00000008 00000000
8 .comment 00000016 00000000
9 .note.GNU-stack 00000000 00000000
10 .data 00000000 00000000 DATA
11 .bss 00000000 00000000 BSS
12 .reginfo 00000018 00000000
13 .MIPS.abiflags 00000018 00000000
14 .llvm_addrsig 00000001 00000000
15 .symtab 00000040 00000000
SYMBOL TABLE:
00000000 l df *ABS* 00000000 calibrate.i
00000000 w F .text.calibrate_delay_is_known 00000034 calibrate_delay_is_known
00000000 *UND* 00000000 __sanitizer_cov_trace_pc
RELOCATION RECORDS FOR [.text.calibrate_delay_is_known]:
OFFSET TYPE VALUE
00000010 R_MIPS_26 __sanitizer_cov_trace_pc
RELOCATION RECORDS FOR [.pdr]:
OFFSET TYPE VALUE
00000000 R_MIPS_32 calibrate_delay_is_known
Without -fsanitize-coverage=trace-pc
:
$ clang --target=mipsel-linux-gnu -ffunction-sections -c calibrate.i
$ ./recordmcount calibrate.o
$ llvm-objdump -x calibrate.o
calibrate.o: file format elf32-mips
architecture: mipsel
start address: 0x00000000
Program Header:
Dynamic Section:
Sections:
Idx Name Size VMA Type
0 00000000 00000000
1 .strtab 000000a3 00000000
2 .text 00000000 00000000 TEXT
3 .mdebug.abi32 00000000 00000000
4 .text.calibrate_delay_is_known 0000002c 00000000 TEXT
5 .pdr 00000020 00000000
6 .rel.pdr 00000008 00000000
7 .comment 00000016 00000000
8 .note.GNU-stack 00000000 00000000
9 .data 00000000 00000000 DATA
10 .bss 00000000 00000000 BSS
11 .reginfo 00000018 00000000
12 .MIPS.abiflags 00000018 00000000
13 .llvm_addrsig 00000000 00000000
14 .symtab 00000030 00000000
SYMBOL TABLE:
00000000 l df *ABS* 00000000 calibrate.i
00000000 w F .text.calibrate_delay_is_known 0000002c calibrate_delay_is_known
RELOCATION RECORDS FOR [.pdr]:
OFFSET TYPE VALUE
00000000 R_MIPS_32 calibrate_delay_is_known
There is a new instance of this problem after commit dbe69b299884 ("bpf: Fix dispatcher patchable function entry to 5 bytes nop") for certain configurations:
$ make -skj"$(nproc)" ARCH=powerpc CROSS_COMPILE=powerpc-linux-gnu- LLVM=1 mrproper powernv_defconfig all
Cannot find symbol for section 4: .init.text.
kernel/bpf/dispatcher.o: failed
https://github.com/linuxppc/issues/issues/388 alludes to this issue. Looks like binutils reverted dropping section symbols just for ppc: https://github.com/bminor/binutils-gdb/commit/c09c8b42021180eee9495bd50d8b35e683d3901b cc @MaskRay
That's annoying :/ for what it's worth, I have seen that error on i386 as well, so it is not just powerpc that is affected by this.
I think recordmcount
is only run for ftrace so maybe a diff like this would help out?
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index e9e95c790b8e..233836893fd8 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -744,6 +744,7 @@ config FTRACE_MCOUNT_USE_RECORDMCOUNT
depends on !FTRACE_MCOUNT_USE_CC
depends on !FTRACE_MCOUNT_USE_OBJTOOL
depends on FTRACE_MCOUNT_RECORD
+ depends on !AS_IS_LLVM
config TRACING_MAP
bool
While that diff stops the build error because it disables the use of recordmcount, it does not prevent ftrace from being selected altogether, which may lead to further reports of ftrace not working, despite being selected. We might be able to fix that error in a similar manner as Arnd's previous patches but I am not sure how to go about that...
I am not sure how to go about that...
More specifically, I only tried removing __init
from bpf_arch_init_dispatcher_early()
in kernel/bpf/dispatcher.c
but that is not enough since the declaration in include/linux/bpf.h
wins. We cannot remove __init
altogether as the x86 version of bpf_arch_init_dispatcher_early()
calls text_poke_early()
, which is marked __init_or_module
, which expands to nothing if CONFIG_MODULES
is enabled or __init
if not. With that in mind, the following diff resolves the failure that I note above for that specific configuration; so far, I have only seen that failure in three different configurations. It will still be reproducible with CONFIG_MODULES
disabled but that is probably okay for now. I can send this as a formal patch on Monday if it seems reasonable.
diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 00127abd89ee..4145939bbb6a 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -389,7 +389,7 @@ static int __bpf_arch_text_poke(void *ip, enum bpf_text_poke_type t,
return ret;
}
-int __init bpf_arch_init_dispatcher_early(void *ip)
+int __init_or_module bpf_arch_init_dispatcher_early(void *ip)
{
const u8 *nop_insn = x86_nops[5];
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 0566705c1d4e..4aa7bde406f5 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -971,7 +971,7 @@ struct bpf_trampoline *bpf_trampoline_get(u64 key,
struct bpf_attach_target_info *tgt_info);
void bpf_trampoline_put(struct bpf_trampoline *tr);
int arch_prepare_bpf_dispatcher(void *image, void *buf, s64 *funcs, int num_funcs);
-int __init bpf_arch_init_dispatcher_early(void *ip);
+int __init_or_module bpf_arch_init_dispatcher_early(void *ip);
#define BPF_DISPATCHER_INIT(_name) { \
.mutex = __MUTEX_INITIALIZER(_name.mutex), \
diff --git a/kernel/bpf/dispatcher.c b/kernel/bpf/dispatcher.c
index 04f0a045dcaa..e14a68e9a74f 100644
--- a/kernel/bpf/dispatcher.c
+++ b/kernel/bpf/dispatcher.c
@@ -91,7 +91,7 @@ int __weak arch_prepare_bpf_dispatcher(void *image, void *buf, s64 *funcs, int n
return -ENOTSUPP;
}
-int __weak __init bpf_arch_init_dispatcher_early(void *ip)
+int __weak __init_or_module bpf_arch_init_dispatcher_early(void *ip)
{
return -ENOTSUPP;
}
Patch submitted: https://lore.kernel.org/20221031173819.2344270-1-nathan@kernel.org/
It sounds like the original patch that caused the recent bpf issue might get reverted in favor of a difference fix:
https://lore.kernel.org/Y2DRVwI4bNUppmXJ@krava/
https://lore.kernel.org/87iljyyes6.fsf@all.your.base.are.belong.to.us/
Sent a fix for another instance of this problem: https://lore.kernel.org/lkml/20230414080418.110236-1-arnd@kernel.org/T/#u
Using AS=clang to build with integrated-as, on x86_64, when scripts/recordmcount is run on certain objects (for me it happens with init/initramfs.o and kernel/elfcore.o at least) I get the error in the title.