Closed albe19029 closed 10 months ago
Hi @albe19029! Could you provide more context on why it don't run?
In my logs I get next error: libscap: bpf_load_program() event=raw_tracepoint/filler/sys_procexit_e: Operation not permitted
But in file2.txt there is also an error: llc -march=bpf -filetype=obj -o /usr/src/scap-6.0.1+driver/bpf/probe.o /usr/src/scap-6.0.1+driver/bpf/probe.ll MODPOST /usr/src/scap-6.0.1+driver/bpf/Module.symvers /bin/sh: scripts/mod/modpost: cannot execute binary file: Exec format error
I think this 2 problems are related.
@therealbobo is there any information required to reproduce the issue?
Hey @albe19029! Thank you for the issue! We are investigating it! Just out of curiosity: why don't you try the modern ebpf probe? It doesn't require any additional compilation :)
To be honest, I didn't think about it. For x64 we needed to support older kernels. But for arm64 the version with which everything works stably is 5.8. So it makes sense. I'll try and let you know the results.
Are you encountering the same problem on x64?
no, on x64 everything working perfectly.
For arm64 bugs like this blocks us of using scap in production:
https://cdn.kernel.org/pub/linux/kernel/v5.x/ChangeLog-5.5
As when code try to read valid user space memory on kernel code (bpf_probe_read*) - sometimes it says it is in invalid. It works stable only starting from kernel 5.8. Didn't find which commit on version 5.8 fixed to issue fully, but starting only from this version arm64 user space check logic working correct for valid cases.
That's strange! You could open an issue on https://github.com/falcosecurity/libs : sysdig
just uses libscap from there as building block :) BTW please let me know if the modern bpf works smoothly!
Well, I say there were workaround for clone and execve from their side (https://github.com/falcosecurity/libs/issues/1605). And this changes helped us a lot. But since fixes on memory access were not for bpf (even module scap driver fails), but for arm64 kernel code - I thought it was hard to fix it on https://github.com/falcosecurity/libs side also.
Looking around the header issue seems related to arm64 only.
Correct, we faced this issue only on arm64, and only for GKE server (Azure, AWS working correct)
There is a varialbe for bpf driver - SYSDIG_BPF_PROBE, but how can I enable modern bpf?
just use the --modern-bpf
cli flag :)
and if I use scap-driver-loader to build driver, and then resulting file in my code?
You don't need it! The modern bpf probe is already compiled and bundled inside the sysdig binary :)
Sorry for delay, but it took me some time to build modern bpf for our project. Unfortunately, when I ran the tests for our project - I saw event loss errors. It will require time to debug this errors, but the behavior of modern bpf and old one have differences.
Maybe there is an update about bpf error for GKE? Could you reproduce an issue? And maybe know how to fix it? Just to understand if there will be a fix in 1-2 weeks, or we should wait a bit longer. Thanks.
For modern bpf we have plans to migrate to it, and as we have an errors - we will investigate them and will create an issue with description for https://github.com/falcosecurity/libs But probably it will be a bit later (will discuss with team when it will be).
Could you please check out if you have the div64.h
header somewhere? 🤔
No, we don't. The only div64.h we have is from this archive https://storage.googleapis.com/cos-tools/17412.156.23/kernel-headers.tgz
if I use this link: https://storage.googleapis.com/cos-tools/17412.156.23/lakitu-arm64/kernel-headers.tgz I get next div64.h files: ./include/asm-generic/div64.h ./arch/arm64/include/generated/asm/div64.h ./arch/arm/include/asm/div64.h ./arch/m68k/include/asm/div64.h ./arch/alpha/include/asm/div64.h ./arch/x86/include/asm/div64.h ./arch/ia64/include/asm/div64.h ./arch/mips/include/asm/div64.h
If I use https://storage.googleapis.com/cos-tools/17412.156.23/kernel-headers.tgz: ./include/asm-generic/div64.h ./arch/arm/include/asm/div64.h ./arch/m68k/include/asm/div64.h ./arch/alpha/include/asm/div64.h ./arch/x86/include/asm/div64.h ./arch/ia64/include/asm/div64.h ./arch/mips/include/asm/div64.h
It might be enough doing something like sudo ln -s /usr/include/asm-generic /usr/include/asm
🤔
and which version of kernel-header to use? lakitu-arm64 or current one?
I'd bet on the current one but a quick uname -a
will probably give you the correct answer :)
uname -a Linux gke-qa-dec-2028-18-12--8-default-pool-f35026d3-k2tn 5.15.120+ #1 SMP Sat Aug 19 11:17:43 UTC 2023 aarch64 GNU/Linux
cat /etc/os-release NAME="Container-Optimized OS" ID=cos PRETTY_NAME="Container-Optimized OS from Google" HOME_URL="https://cloud.google.com/container-optimized-os/docs" BUG_REPORT_URL="https://cloud.google.com/container-optimized-os/docs/resources/support-policy#contact_us" GOOGLE_METRICS_PRODUCT_ID=26 GOOGLE_CRASH_ID=Lakitu-arm KERNEL_COMMIT_ID=f0d6dcd5188bababf189e3aede8360342859fcb8 VERSION=105 VERSION_ID=105 BUILD_ID=17412.156.23
No luck there. Could you please checkout the /usr/include
directory? Please keep an eye open on any symbolic link present there.
For host system - no /usr/include directory. For container - /usr/include directory contains file from Red Hat Enterprise Linux 8
What should I check there?
To reproduce an issue I use next yaml (file is saved in txt) : scap.txt
Then I run this file on any GKE kubernates cluster (arm64): kubectl apply -f scap.yaml
And then attach to pod: kubectl exec --stdin --tty sysdig-0341 -- /bin/bash
And run scap-driver-loader. And get div64.h error.
After editing /usr/bin/scap-driver-loader (link to arm kernel headers) I run scap-driver-loader again and get second problem.
As you can see I share only /etc and /boot from host, so there can't be any conflict, as I use docker.io/sysdig/sysdig:0.34.1 image.
I have checked both scripts/mod/modpost from kernel header archive and get next information For https://storage.googleapis.com/cos-tools/17412.156.23/lakitu-arm64/kernel-headers.tgz I get next result:
file modpost modpost: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[xxHash]=96cdb1cdfa76c1f3, not stripped
For https://storage.googleapis.com/cos-tools/17412.156.23/kernel-headers.tgz I get next result:
file modpost modpost: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[xxHash]=96cdb1cdfa76c1f3, not stripped
So for arm64 there is an invalid modpost binary. Will continue to investigate why.
So, after I added arm64 modprobe (borrowed from AWS kernel)
if [ "${TARGET_ID}" == "cos" ] && [ "${ARCH}" == "aarch64" ]; then cp /modpost "$KERNELDIR/scripts/mod" fi
The compilation finished with success result. But still the code is not running. So probably there are errors with in lakitu-arm64/kernel-headers.tgz.
For now have no ideas how to fix it or investigate further.
Have created a bug for ChromeOS team. https://issuetracker.google.com/issues/321501036
That's not the first time I encounter this:
modpost: ELF 64-bit LSB shared object, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[xxHash]=96cdb1cdfa76c1f3, not stripped
In my opinion, you should create a symlink to the ./include/asm-generic that points to ./include/asm . Other than that, I'm out of ideas too :/
Well, file /usr/src/linux-headers-5.15.120+/arch/arm64/include/generated/asm/div64.h has next content:
Is it good?
Ops, I missed the part where you said the compilation was successful. Can you attach the logs of the build?
This logs I get after changing link to https://storage.googleapis.com/cos-tools/17412.156.23/lakitu-arm64/kernel-headers.tgz and replaces modprobe to valid one.
I also add --trace to make and remove > /dev/null.
make -C "/usr/src/${DRIVER_NAME}-${DRIVER_VERSION}/bpf" > /dev/null
to make -C "/usr/src/${DRIVER_NAME}-${DRIVER_VERSION}/bpf" --trace
That's great! But how is sysdig failing? Could you share that log? 🤔
Yes, sysdig failed. Here is a log. sysdig_log.txt
I have found that on x64 starting from sysdig 0.33.1 - sysdig is not working also. With the same error:
libscap: bpf_load_program() event=raw_tracepoint/filler/sys_procexit_e: Operation not permitted
On sysdig 0.32.1 - everything is working. So maybe error is not only arm64, but common.
Have checked sysdig 0.32.1 on arm with link fix - sysdig is working correct. So there is sure a corruption for COS starting from 0.33.1 version of sysdig.
1) So this fix is correct: if [ "${ARCH}" == "aarch64" ]; then BPF_KERNEL_SOURCES_URL="https://storage.googleapis.com/cos-tools/${BUILD_ID}/lakitu-arm64/kernel-headers.tgz" else BPF_KERNEL_SOURCES_URL="https://storage.googleapis.com/cos-tools/${BUILD_ID}/kernel-headers.tgz" fi
2) Even with GKE binary corruption sysdig 0.32.1 is working correctly both for arm64 and x64 with link fix. 3) starting from sysdig 0.33.1 - cos is not working both for arm64 and x64.
Good day, I have found a problem. This commit leads to problems on GKE.
https://github.com/falcosecurity/libs/commit/1e06bd3f4f8bb9244caf4e33d5d110c482d88ee5
So there is a loop with 2 max values:
For COS kernel this is too big. Which leads to this errors:
processed 40396 insns (limit 1000000) max_states_per_insn 1 total_states 4057 peak_states 4057 mark_read 73 -- END PROG LOAD LOG -- libscap: bpf_load_program() event=raw_tracepoint/filler/sys_procexit_e: Operation not permitted
And now this message is clear. function sys_procexit_e has more then 1M instruction from point of BPF verifier view. I tested a bit, and found that with values:
this code is also working for both arm64 and x64. So will create an issue for falco lib team.
As I understand from your side I need only a fix of a link for arm64.
https://storage.googleapis.com/cos-tools/17412.156.23/lakitu-arm64/kernel-headers.tgz
And a ticket can be closed.
Hey @albe19029! Thank you so much for the in deep investigation! Great catch!
This is a ticket for falco lib team. https://github.com/falcosecurity/libs/issues/1639
@therealbobo Sorry, you closed a ticket, but what about invalid link? As I can see it is still not fixed.
Fix for https://github.com/falcosecurity/libs/issues/1639 is ready. Don't you know when there will be a new release of sysdig and is it possible to add this fix to it? As current 0.34.1 - will broke GKE.
The next sysdig release is coming in the next days. I have to double check but I think that we can apply this patch :)
It will be great, as this bug blocking us very much. Thanks in advance.
@therealbobo will there be a fix of invalid link for scap-driver-loader.in?
Also there was a release of 0.35.0 but without COS driver fix. Don't you know when there will be a patch release?
Hey @albe19029! I just released 0.35.1
with all the fixes! Please let me know if you encounter any problem! :)
Thanks a lot, will check this version and let you know about the results.
When I try to build bpf driver on arm64 GKE server I get error listed in file1.txt file1.txt
As I can see - link https://storage.googleapis.com/cos-tools/17412.156.23/kernel-headers.tgz is invalid, as for arm64 it should be
https://storage.googleapis.com/cos-tools/17412.156.23/lakitu-arm64/kernel-headers.tgz (from GKE docs - https://cloud.google.com/container-optimized-os/docs/resources/sources)
But even when I add next code in scap-driver-loader with the fix:
if [ "${ARCH}" == "aarch64" ]; then BPF_KERNEL_SOURCES_URL="https://storage.googleapis.com/cos-tools/${BUILD_ID}/lakitu-arm64/kernel-headers.tgz" else BPF_KERNEL_SOURCES_URL="https://storage.googleapis.com/cos-tools/${BUILD_ID}/kernel-headers.tgz" fi
I managed to build driver, but it don't run. And while compilation I have output listed in file file2.txt file2.txt
Can you please help to fix this error correct. Thanks in advance.