draios / sysdig

Linux system exploration and troubleshooting tool with first class support for containers
http://www.sysdig.com/
Other
7.68k stars 728 forks source link

Compile error for BPF driver on arm64 GKE server #2057

Closed albe19029 closed 5 months ago

albe19029 commented 5 months ago

When I try to build bpf driver on arm64 GKE server I get error listed in file1.txt file1.txt

As I can see - link https://storage.googleapis.com/cos-tools/17412.156.23/kernel-headers.tgz is invalid, as for arm64 it should be

https://storage.googleapis.com/cos-tools/17412.156.23/lakitu-arm64/kernel-headers.tgz (from GKE docs - https://cloud.google.com/container-optimized-os/docs/resources/sources)

But even when I add next code in scap-driver-loader with the fix:

if [ "${ARCH}" == "aarch64" ]; then BPF_KERNEL_SOURCES_URL="https://storage.googleapis.com/cos-tools/${BUILD_ID}/lakitu-arm64/kernel-headers.tgz" else BPF_KERNEL_SOURCES_URL="https://storage.googleapis.com/cos-tools/${BUILD_ID}/kernel-headers.tgz" fi

I managed to build driver, but it don't run. And while compilation I have output listed in file file2.txt file2.txt

Can you please help to fix this error correct. Thanks in advance.

therealbobo commented 5 months ago

Hi @albe19029! Could you provide more context on why it don't run?

albe19029 commented 5 months ago

In my logs I get next error: libscap: bpf_load_program() event=raw_tracepoint/filler/sys_procexit_e: Operation not permitted

But in file2.txt there is also an error: llc -march=bpf -filetype=obj -o /usr/src/scap-6.0.1+driver/bpf/probe.o /usr/src/scap-6.0.1+driver/bpf/probe.ll MODPOST /usr/src/scap-6.0.1+driver/bpf/Module.symvers /bin/sh: scripts/mod/modpost: cannot execute binary file: Exec format error

I think this 2 problems are related.

albe19029 commented 5 months ago

@therealbobo is there any information required to reproduce the issue?

therealbobo commented 5 months ago

Hey @albe19029! Thank you for the issue! We are investigating it! Just out of curiosity: why don't you try the modern ebpf probe? It doesn't require any additional compilation :)

albe19029 commented 5 months ago

To be honest, I didn't think about it. For x64 we needed to support older kernels. But for arm64 the version with which everything works stably is 5.8. So it makes sense. I'll try and let you know the results.

therealbobo commented 5 months ago

Are you encountering the same problem on x64?

albe19029 commented 5 months ago

no, on x64 everything working perfectly.

For arm64 bugs like this blocks us of using scap in production:

https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.5

As when code try to read valid user space memory on kernel code (bpf_probe_read*) - sometimes it says it is in invalid. It works stable only starting from kernel 5.8. Didn't find which commit on version 5.8 fixed to issue fully, but starting only from this version arm64 user space check logic working correct for valid cases.

therealbobo commented 5 months ago

That's strange! You could open an issue on https://github.com/falcosecurity/libs : sysdig just uses libscap from there as building block :) BTW please let me know if the modern bpf works smoothly!

albe19029 commented 5 months ago

Well, I say there were workaround for clone and execve from their side (https://github.com/falcosecurity/libs/issues/1605). And this changes helped us a lot. But since fixes on memory access were not for bpf (even module scap driver fails), but for arm64 kernel code - I thought it was hard to fix it on https://github.com/falcosecurity/libs side also.

therealbobo commented 5 months ago

Looking around the header issue seems related to arm64 only.

albe19029 commented 5 months ago

Correct, we faced this issue only on arm64, and only for GKE server (Azure, AWS working correct)

albe19029 commented 5 months ago

There is a varialbe for bpf driver - SYSDIG_BPF_PROBE, but how can I enable modern bpf?

therealbobo commented 5 months ago

just use the --modern-bpf cli flag :)

albe19029 commented 5 months ago

and if I use scap-driver-loader to build driver, and then resulting file in my code?

therealbobo commented 5 months ago

You don't need it! The modern bpf probe is already compiled and bundled inside the sysdig binary :)

albe19029 commented 5 months ago

Sorry for delay, but it took me some time to build modern bpf for our project. Unfortunately, when I ran the tests for our project - I saw event loss errors. It will require time to debug this errors, but the behavior of modern bpf and old one have differences.

albe19029 commented 5 months ago

Maybe there is an update about bpf error for GKE? Could you reproduce an issue? And maybe know how to fix it? Just to understand if there will be a fix in 1-2 weeks, or we should wait a bit longer. Thanks.

For modern bpf we have plans to migrate to it, and as we have an errors - we will investigate them and will create an issue with description for https://github.com/falcosecurity/libs But probably it will be a bit later (will discuss with team when it will be).

therealbobo commented 5 months ago

Could you please check out if you have the div64.h header somewhere? 🤔

albe19029 commented 5 months ago

No, we don't. The only div64.h we have is from this archive https://storage.googleapis.com/cos-tools/17412.156.23/kernel-headers.tgz

albe19029 commented 5 months ago

if I use this link: https://storage.googleapis.com/cos-tools/17412.156.23/lakitu-arm64/kernel-headers.tgz I get next div64.h files: ./include/asm-generic/div64.h ./arch/arm64/include/generated/asm/div64.h ./arch/arm/include/asm/div64.h ./arch/m68k/include/asm/div64.h ./arch/alpha/include/asm/div64.h ./arch/x86/include/asm/div64.h ./arch/ia64/include/asm/div64.h ./arch/mips/include/asm/div64.h

If I use https://storage.googleapis.com/cos-tools/17412.156.23/kernel-headers.tgz: ./include/asm-generic/div64.h ./arch/arm/include/asm/div64.h ./arch/m68k/include/asm/div64.h ./arch/alpha/include/asm/div64.h ./arch/x86/include/asm/div64.h ./arch/ia64/include/asm/div64.h ./arch/mips/include/asm/div64.h

therealbobo commented 5 months ago

It might be enough doing something like sudo ln -s /usr/include/asm-generic /usr/include/asm 🤔

albe19029 commented 5 months ago

and which version of kernel-header to use? lakitu-arm64 or current one?

therealbobo commented 5 months ago

I'd bet on the current one but a quick uname -a will probably give you the correct answer :)

albe19029 commented 5 months ago

uname -a Linux gke-qa-dec-2028-18-12--8-default-pool-f35026d3-k2tn 5.15.120+ #1 SMP Sat Aug 19 11:17:43 UTC 2023 aarch64 GNU/Linux

cat /etc/os-release NAME="Container-Optimized OS" ID=cos PRETTY_NAME="Container-Optimized OS from Google" HOME_URL="https://cloud.google.com/container-optimized-os/docs" BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us" GOOGLE_METRICS_PRODUCT_ID=26 GOOGLE_CRASH_ID=Lakitu-arm KERNEL_COMMIT_ID=f0d6dcd5188bababf189e3aede8360342859fcb8 VERSION=105 VERSION_ID=105 BUILD_ID=17412.156.23

therealbobo commented 5 months ago

No luck there. Could you please checkout the /usr/include directory? Please keep an eye open on any symbolic link present there.

albe19029 commented 5 months ago

For host system - no /usr/include directory. For container - /usr/include directory contains file from Red Hat Enterprise Linux 8

What should I check there?

albe19029 commented 5 months ago

To reproduce an issue I use next yaml (file is saved in txt) : scap.txt

Then I run this file on any GKE kubernates cluster (arm64): kubectl apply -f scap.yaml

And then attach to pod: kubectl exec --stdin --tty sysdig-0341 -- /bin/bash

And run scap-driver-loader. And get div64.h error.

After editing /usr/bin/scap-driver-loader (link to arm kernel headers) I run scap-driver-loader again and get second problem.

As you can see I share only /etc and /boot from host, so there can't be any conflict, as I use docker.io/sysdig/sysdig:0.34.1 image.

albe19029 commented 5 months ago

I have checked both scripts/mod/modpost from kernel header archive and get next information For https://storage.googleapis.com/cos-tools/17412.156.23/lakitu-arm64/kernel-headers.tgz I get next result:

file modpost modpost: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[xxHash]=96cdb1cdfa76c1f3, not stripped

For https://storage.googleapis.com/cos-tools/17412.156.23/kernel-headers.tgz I get next result:

file modpost modpost: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[xxHash]=96cdb1cdfa76c1f3, not stripped

So for arm64 there is an invalid modpost binary. Will continue to investigate why.

albe19029 commented 5 months ago

So, after I added arm64 modprobe (borrowed from AWS kernel)

if [ "${TARGET_ID}" == "cos" ] && [ "${ARCH}" == "aarch64" ]; then cp /modpost "$KERNELDIR/scripts/mod" fi

The compilation finished with success result. But still the code is not running. So probably there are errors with in lakitu-arm64/kernel-headers.tgz.

For now have no ideas how to fix it or investigate further.

albe19029 commented 5 months ago

Have created a bug for ChromeOS team. https://issuetracker.google.com/issues/321501036

therealbobo commented 5 months ago

That's not the first time I encounter this:

modpost: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[xxHash]=96cdb1cdfa76c1f3, not stripped

In my opinion, you should create a symlink to the ./include/asm-generic that points to ./include/asm . Other than that, I'm out of ideas too :/

albe19029 commented 5 months ago

Well, file /usr/src/linux-headers-5.15.120+/arch/arm64/include/generated/asm/div64.h has next content:

include <asm-generic/div64.h>

Is it good?

therealbobo commented 5 months ago

Ops, I missed the part where you said the compilation was successful. Can you attach the logs of the build?

albe19029 commented 5 months ago

success_logs.txt

This logs I get after changing link to https://storage.googleapis.com/cos-tools/17412.156.23/lakitu-arm64/kernel-headers.tgz and replaces modprobe to valid one.

I also add --trace to make and remove > /dev/null.

make -C "/usr/src/${DRIVER_NAME}-${DRIVER_VERSION}/bpf" > /dev/null

to make -C "/usr/src/${DRIVER_NAME}-${DRIVER_VERSION}/bpf" --trace

therealbobo commented 5 months ago

That's great! But how is sysdig failing? Could you share that log? 🤔

albe19029 commented 5 months ago

Yes, sysdig failed. Here is a log. sysdig_log.txt

albe19029 commented 5 months ago

I have found that on x64 starting from sysdig 0.33.1 - sysdig is not working also. With the same error:

libscap: bpf_load_program() event=raw_tracepoint/filler/sys_procexit_e: Operation not permitted

On sysdig 0.32.1 - everything is working. So maybe error is not only arm64, but common.

albe19029 commented 5 months ago

Have checked sysdig 0.32.1 on arm with link fix - sysdig is working correct. So there is sure a corruption for COS starting from 0.33.1 version of sysdig.

albe19029 commented 5 months ago

1) So this fix is correct: if [ "${ARCH}" == "aarch64" ]; then BPF_KERNEL_SOURCES_URL="https://storage.googleapis.com/cos-tools/${BUILD_ID}/lakitu-arm64/kernel-headers.tgz" else BPF_KERNEL_SOURCES_URL="https://storage.googleapis.com/cos-tools/${BUILD_ID}/kernel-headers.tgz" fi

2) Even with GKE binary corruption sysdig 0.32.1 is working correctly both for arm64 and x64 with link fix. 3) starting from sysdig 0.33.1 - cos is not working both for arm64 and x64.

albe19029 commented 5 months ago

Good day, I have found a problem. This commit leads to problems on GKE.

https://github.com/falcosecurity/libs/commit/1e06bd3f4f8bb9244caf4e33d5d110c482d88ee5

So there is a loop with 2 max values:

define MAX_THREADS_GROUPS 30

define MAX_HIERARCHY_TRAVERSE 60

For COS kernel this is too big. Which leads to this errors:

processed 40396 insns (limit 1000000) max_states_per_insn 1 total_states 4057 peak_states 4057 mark_read 73 -- END PROG LOAD LOG -- libscap: bpf_load_program() event=raw_tracepoint/filler/sys_procexit_e: Operation not permitted

And now this message is clear. function sys_procexit_e has more then 1M instruction from point of BPF verifier view. I tested a bit, and found that with values:

define MAX_THREADS_GROUPS 25

define MAX_HIERARCHY_TRAVERSE 35

this code is also working for both arm64 and x64. So will create an issue for falco lib team.

albe19029 commented 5 months ago

As I understand from your side I need only a fix of a link for arm64.

https://storage.googleapis.com/cos-tools/17412.156.23/lakitu-arm64/kernel-headers.tgz

And a ticket can be closed.

therealbobo commented 5 months ago

Hey @albe19029! Thank you so much for the in deep investigation! Great catch!

albe19029 commented 5 months ago

This is a ticket for falco lib team. https://github.com/falcosecurity/libs/issues/1639

albe19029 commented 5 months ago

@therealbobo Sorry, you closed a ticket, but what about invalid link? As I can see it is still not fixed.

albe19029 commented 5 months ago

Fix for https://github.com/falcosecurity/libs/issues/1639 is ready. Don't you know when there will be a new release of sysdig and is it possible to add this fix to it? As current 0.34.1 - will broke GKE.

therealbobo commented 5 months ago

The next sysdig release is coming in the next days. I have to double check but I think that we can apply this patch :)

albe19029 commented 5 months ago

It will be great, as this bug blocking us very much. Thanks in advance.

albe19029 commented 5 months ago

@therealbobo will there be a fix of invalid link for scap-driver-loader.in?

Also there was a release of 0.35.0 but without COS driver fix. Don't you know when there will be a patch release?

therealbobo commented 5 months ago

Hey @albe19029! I just released 0.35.1 with all the fixes! Please let me know if you encounter any problem! :)

albe19029 commented 5 months ago

Thanks a lot, will check this version and let you know about the results.