Closed dnwe closed 7 months ago
@fntlnz RUNBPF
Thanks for this issue @dnwe !!
I think that you are referring to this configuration right?
I see that we already have some documentation for that here https://falco.org/docs/event-sources/dropped-events/
Do you see anything we could improve?
/remove-kind feature /kind documentation
RUNBPF contest
Send me an email `lo at linux.com` with your full name and address for the sticker! (I also accept encrypted emails if you have privacy concerns. You can get my public key here https://fntlnz.wtf/downloads/pubkey-0xD624DE73B2400EE4.asc)
@fntlnz that documents how we log when syscalls were dropped, but as per my quote from the linked blog article the original author mentioned that a user could also increase the size of the shared buffer to prevent any syscalls from being dropped if it were sized large enough
I think this is the droid you are looking for: https://github.com/draios/sysdig/blob/dev/driver/ppm_ringbuffer.h#L17
Let me know if you want to pair if you get snagged compiling the driver - it took me a few tries to get everything dialed in correctly
@kris-nova perfect thanks, I'll give it a spin tomorrow and let you know
Yes @kris-nova it is!
@kris-nova and @dnwe can I propose to demo the compiling of the driver during our office hours in 2 weeks? It will be recorded and it can be handy to our community! Thus, in case do you agree please open an issue (with kind/debugging-hours) in the office-hours repository so we'll schedule it !
@leodido / @kris-nova I haven't quite got around to testing it yet, but I was hoping I'd just be able to get away with patching the kernel module via the stable Dockerfile after the .deb has unpacked the source to /usr/src:
With something like this:
diff --git a/Dockerfile b/Dockerfile
index d55ed24..1210e8d 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -44,6 +44,9 @@ RUN curl -s https://s3.amazonaws.com/download.draios.com/DRAIOS-GPG-KEY.public |
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
+# Patch the ringbuffer in the falco kernel module to reduce dropped syscall events
+RUN sed -e '/RING_BUF_SIZE/s/8/64/' -i /usr/src/falco-*/ppm_ringbuffer.h
+
# Change the falco config within the container to enable ISO 8601
# output.
RUN sed -e 's/time_format_iso_8601: false/time_format_iso_8601: true/' < /etc/falco/falco.yaml > /etc/falco/falco.yaml.new \
And let the falco-probe-loader / dkms build handle getting the compilation right for me 😎
Ah so that gives me:
Wed Sep 4 20:41:56 2019: Runtime error: error mapping the ring buffer for device /host/dev/falco0. Exiting.
Presumably because the /usr/bin/falco userspace process needs to be (re-)built with a matching ringbuffer size
and
Yep I think thats what we need.
Just curious what is the use case here for expanding the ring buffer size? Is this in response to the kernel level components dropping syscall events?
Yes as per the first post, the linked article called it out as an option to reduce the amount of dropped syscalls. We already enabled logging and just wanted to test it out to see if we could reduce the occurrences.
I believe I got a build working, but hadn't had a chance to deploy it out yet. Will let you know
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
FYI we have been running for a while now with sed -e '/RING_BUF_SIZE/s/8/96/' -i /usr/src/sysdig/driver/ppm_ringbuffer.h
which has reduced the number of dropped events, although we still see 1 or 2, but quite infrequently
{
"output": "Falco internal: syscall event drop. 1 system calls dropped in last second.",
"output_fields": {
"ebpf_enabled": "0",
"n_drops": "1",
"n_drops_buffer": "1",
"n_drops_bug": "0",
"n_drops_pf": "0",
"n_evts": "41188"
},
"priority": "Critical",
"rule": "Falco internal: syscall event drop",
"time": "2019-11-13T17:14:13.034211346Z"
}
{
"output": "Falco internal: syscall event drop. 1 system calls dropped in last second.",
"output_fields": {
"ebpf_enabled": "0",
"n_drops": "1",
"n_drops_buffer": "1",
"n_drops_bug": "0",
"n_drops_pf": "0",
"n_evts": "31529"
},
"priority": "Critical",
"rule": "Falco internal: syscall event drop",
"time": "2019-11-13T17:51:17.684344697Z"
}
{
"output": "Falco internal: syscall event drop. 1 system calls dropped in last second.",
"output_fields": {
"ebpf_enabled": "0",
"n_drops": "1",
"n_drops_buffer": "1",
"n_drops_bug": "0",
"n_drops_pf": "0",
"n_evts": "23159"
},
"priority": "Critical",
"rule": "Falco internal: syscall event drop",
"time": "2019-11-13T20:59:24.474584970Z"
}
From Repository Planning - Low hanging fruit: start by running in different environments with different kinds of Cpus and workloads and document suggested sizes for the ring buffer.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
keeping fresh
What does here think about making the ring buffer size configurable via a flag ?
Also for everyone reading here, the ring buffer is not used for the eBPF implementation. This might also be useful information when comparing performances and to go to the root of the cause of drops.
We've never tried out the eBPF impl as the docs seemed like they suggested the kernel module was the recommended choice. Do we still get notified of drops with eBPF in the same way? I could give it a spin
@dnwe yes! drops go through the same process.
Just yesterday I made a PR documenting the eBPF installation process for multiple kinds of installations https://github.com/falcosecurity/falco-website/pull/134
Please be aware you might still have SYSDIG_BPF_PROBE
if you are on < 0.18.0 - everything should be consistently set to FALCO_BPF_PROBE
for 0.20.0
Is there a way to debug the dropped events? On a small VPS, we see e.g. this:
Apr 5 15:20:02 host falco: Falco internal: syscall event drop. 1 system calls dropped in last second.
Apr 5 15:20:02 host falco: 15:20:02.335890806: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=0 n_drops=1 n_drops_buffer=1 n_drops_bug=0 n_drops_pf=0 n_evts=12297)
Apr 5 15:25:02 host falco: Falco internal: syscall event drop. 2 system calls dropped in last second.
Apr 5 15:25:02 host falco: 15:25:02.429912553: Critical Falco internal: syscall event drop. 2 system calls dropped in last second. (ebpf_enabled=0 n_drops=2 n_drops_buffer=2 n_drops_bug=0 n_drops_pf=0 n_evts=12532)
Apr 5 15:45:02 host falco: Falco internal: syscall event drop. 1 system calls dropped in last second.
Apr 5 15:45:02 host falco: 15:45:02.858715115: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=0 n_drops=1 n_drops_buffer=1 n_drops_bug=0 n_drops_pf=0 n_evts=15544)
The n_evts counter looks like a small event count between the log messages. Any idea what happens here?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
What does here think about making the ring buffer size configurable via a flag ?
@fntlnz did anything happen with this in the end? Currently we're still just maintaining our patch and building falco from source, but it'd be cool if we could just use the official docker image and provide the ringbuffer size as a parameter for the userspace app and as a module parameter for the kernel module
@dnwe Is it an option to increase the frequency of reading the buffer as to avoid having it full in the first place ? Just wondering because we are having this issue as well and we would really like to avoid having to recompile everything and spin our own version of the container!
Is it expected behavior to have dropped syscal in the log ? We consistently have 1 to 4 syscall dropped every now and then. Should we ignore there message if it's a low amount of dropped syscall ? I wonder what kind of log someone exploiting cve-2019-8339 would generate. If it would generate hundreds of syscall drop, maybe it's an OK solution to just ignore low syscall drop log message.
Are there any plans to port this workaround into the helm chart so that we can pass it as a parameter during install/upgrade?
@leodido @fntlnz @kris-nova Hey guys, if you are interested in the patch, I could bring it back to falco into the cmake patching folder (falco/cmake/modules/sysdig-repo/patch/).
Let me know what's up.
@nvanheuverzwijn +1
@emcay With our patch, you can pass this environment variable : FALCO_DRIVER_LOADER_ARGS: "--compile --module-arg ring_buf_size=134217728"
You can use our docker image ghcr.io/kronostechnologies/falco:0.24.1-18
(18 commit since patch 0.24.1) however, it is not up to date with the latest as of yet but works for sure.
@nvanheuverzwijn This should be sufficient for testing, thank you! Anyway I could take a look at the Dockerfile
? Due the nature of our project I cannot blindly run software on our infra. :)
@jannis-a The fork is here https://github.com/kronostechnologies/falco
@jannis-a The fork is here https://github.com/kronostechnologies/falco
Just out of curiosity, have you noticed any improvements? If yes, could you provide a comparison? Thanks in advance!
@nvanheuverzwijn can you port your feature upstream as soon as Sysdig merges your PR for the probe?
@danmx All of the PR are open on against every repository for falco and sysdig. I can't do much more than that, it's in the hand of the falco team.
@leogr Our initial goal was to reduce the amount of alerting due to dropped syscall. We could revert back to the old buffer size and compare the result. What would you like to see ? Are you interested in CPU/RAM performance ?
@leogr There's a major decrease in syscall drops, but the CPU usage is a little higher due to the larger buffer / less missed calls
Hey @nvanheuverzwijn and @antoinedeschenes
First of all, thanks again for your amazing work. Indeed we already started to review those PRs, though it will take a bit since we have to test it deeply.
What I'm interested in is:
PS feel free to contact me on Slack if you need any help!
I'd also be interested in performance information 😃 I hope you'll make it public
Sure, we could probably make another custom build (required to have dynamic auditing working) without the sysdig patches, easily switch between both and compare
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle rotten
/remove-lifecycle rotten
@fntlnz What should we patch if we want to increase the buffer size used for the eBPF probe, since you mention in https://github.com/falcosecurity/falco/issues/813#issuecomment-581191932 that the eBPF probe doesn't use the same ring buffer?
To try to answer my question of what to patch for the eBPF probe, I'm guessing it's this constant here:
Would someone be able to confirm if this is correct?
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Provide feedback via https://github.com/falcosecurity/community. /close
@poiana: Closing this issue.
I think this is the droid you are looking for: https://github.com/draios/sysdig/blob/dev/driver/ppm_ringbuffer.h#L17
The link seems to be dead. Does the droid still exist or did the Jawas take it away? I am running falco on a rather large and complex cluster. We are receiving thousands of dropped syscall a day.
Ei @kolbeface this is the config we use in Falco to change the buffer size (The description is pretty detailed): https://github.com/falcosecurity/falco/blob/1b62b5ccd1c64cd972ef0252262075cbf42a130c/falco.yaml#L807
We are receiving thousands of dropped syscall a day.
Do you mind to enable some metrics :point_down: https://github.com/falcosecurity/falco/blob/1b62b5ccd1c64cd972ef0252262075cbf42a130c/falco.yaml#L663 to let us understand why you are facing so many drops. Some additional questions
plugins
?-k
, -K
)?@Andreagit97 thanks for the reply! Forgive me, I am very new to all of this and feel a bit out of my comfort zone with this. I inherited this project from a teammate. I do not believe we are using a driver
image:
registry: docker.internal-mycompany.com/docker-hub
repository: falcosecurity/falco-no-driver
I am going to experiment with expanding the buffer today. I will give a go at enabling metrics, it seems fairly straight forward. If I enable the metrics rule will that create a metrics log and store it in the same place we are storing our other falco logs (S3)?
I am going to experiment with expanding the buffer today. I will give a go at enabling metrics, it seems fairly straight forward. If I enable the metrics rule will that create a metrics log and store it in the same place we are storing our other falco logs (S3)?
Yes :)
What would you like to be added:
I have a Falco-related question I was wondering if anyone could answer in the documentation? Reading https://sysdig.com/blog/cve-2019-8339-falco-vulnerability/ there's a small paragraph that states:
But I couldn't find a mention in the blog article or the Falco docs as to which kernel buffer is being described here. Does anyone know the details? I was assuming we can patch the kernel module that's built via dkms to increase the buffer, but a glance over the code didn't immediately make it obvious which buffer needed to be increased.
Why is this needed:
To reduce the number of dropped syscalls on busy/large nodes