Document falco syscall buffer size adjustment described in blog

dnwe commented 5 years ago

What would you like to be added:

I have a Falco-related question I was wondering if anyone could answer in the documentation? Reading https://sysdig.com/blog/cve-2019-8339-falco-vulnerability/ there's a small paragraph that states:

”One way to reduce the risk of dropped system calls is by increasing the size of the shared buffer between user/kernel space. For example, increasing the size of the default kernel buffer from 8mb to 128mb (per cpu) resulted in no dropped system calls, even under the extreme workload used by the proof-of-concept program”

But I couldn't find a mention in the blog article or the Falco docs as to which kernel buffer is being described here. Does anyone know the details? I was assuming we can patch the kernel module that's built via dkms to increase the buffer, but a glance over the code didn't immediately make it obvious which buffer needed to be increased.

Why is this needed:

To reduce the number of dropped syscalls on busy/large nodes

dnwe commented 5 years ago

@fntlnz RUNBPF

fntlnz commented 5 years ago

Thanks for this issue @dnwe !!

I think that you are referring to this configuration right?

https://github.com/falcosecurity/falco/blob/6e11e75c1522e99bbbad1967f7031538e1a9c0bf/falco.yaml#L84-L89

I see that we already have some documentation for that here https://falco.org/docs/event-sources/dropped-events/

Do you see anything we could improve?

/remove-kind feature /kind documentation

RUNBPF contest

Send me an email `lo at linux.com` with your full name and address for the sticker! (I also accept encrypted emails if you have privacy concerns. You can get my public key here https://fntlnz.wtf/downloads/pubkey-0xD624DE73B2400EE4.asc)

dnwe commented 5 years ago

@fntlnz that documents how we log when syscalls were dropped, but as per my quote from the linked blog article the original author mentioned that a user could also increase the size of the shared buffer to prevent any syscalls from being dropped if it were sized large enough

krisnova commented 5 years ago

I think this is the droid you are looking for: https://github.com/draios/sysdig/blob/dev/driver/ppm_ringbuffer.h#L17

krisnova commented 5 years ago

Let me know if you want to pair if you get snagged compiling the driver - it took me a few tries to get everything dialed in correctly

dnwe commented 5 years ago

@kris-nova perfect thanks, I'll give it a spin tomorrow and let you know

leodido commented 5 years ago

Yes @kris-nova it is!

@kris-nova and @dnwe can I propose to demo the compiling of the driver during our office hours in 2 weeks? It will be recorded and it can be handy to our community! Thus, in case do you agree please open an issue (with kind/debugging-hours) in the office-hours repository so we'll schedule it !

dnwe commented 5 years ago

@leodido / @kris-nova I haven't quite got around to testing it yet, but I was hoping I'd just be able to get away with patching the kernel module via the stable Dockerfile after the .deb has unpacked the source to /usr/src:

https://github.com/falcosecurity/falco/blob/6e11e75c1522e99bbbad1967f7031538e1a9c0bf/docker/stable/Dockerfile#L77-L87

With something like this:

diff --git a/Dockerfile b/Dockerfile
index d55ed24..1210e8d 100644
--- a/Dockerfile
+++ b/Dockerfile
@@ -44,6 +44,9 @@ RUN curl -s https://s3.amazonaws.com/download.draios.com/DRAIOS-GPG-KEY.public |
  && apt-get clean \
  && rm -rf /var/lib/apt/lists/*

+# Patch the ringbuffer in the falco kernel module to reduce dropped syscall events
+RUN sed -e '/RING_BUF_SIZE/s/8/64/' -i /usr/src/falco-*/ppm_ringbuffer.h
+
 # Change the falco config within the container to enable ISO 8601
 # output.
 RUN sed -e 's/time_format_iso_8601: false/time_format_iso_8601: true/' < /etc/falco/falco.yaml > /etc/falco/falco.yaml.new \

And let the falco-probe-loader / dkms build handle getting the compilation right for me 😎

dnwe commented 5 years ago

Ah so that gives me:

Wed Sep 4 20:41:56 2019: Runtime error: error mapping the ring buffer for device /host/dev/falco0. Exiting.

Presumably because the /usr/bin/falco userspace process needs to be (re-)built with a matching ringbuffer size

https://github.com/draios/sysdig/blob/ce8281b2d506114ef1ea89b904cda2baa6c1fa27/userspace/libscap/scap.c#L307-L310

and

https://github.com/draios/sysdig/blob/ce8281b2d506114ef1ea89b904cda2baa6c1fa27/userspace/libscap/scap.c#L351-L370

krisnova commented 5 years ago

Yep I think thats what we need.

Just curious what is the use case here for expanding the ring buffer size? Is this in response to the kernel level components dropping syscall events?

dnwe commented 5 years ago

Yes as per the first post, the linked article called it out as an option to reduce the amount of dropped syscalls. We already enabled logging and just wanted to test it out to see if we could reduce the occurrences.

I believe I got a build working, but hadn't had a chance to deploy it out yet. Will let you know

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dnwe commented 5 years ago

FYI we have been running for a while now with sed -e '/RING_BUF_SIZE/s/8/96/' -i /usr/src/sysdig/driver/ppm_ringbuffer.h which has reduced the number of dropped events, although we still see 1 or 2, but quite infrequently

{
  "output": "Falco internal: syscall event drop. 1 system calls dropped in last second.",
  "output_fields": {
    "ebpf_enabled": "0",
    "n_drops": "1",
    "n_drops_buffer": "1",
    "n_drops_bug": "0",
    "n_drops_pf": "0",
    "n_evts": "41188"
  },
  "priority": "Critical",
  "rule": "Falco internal: syscall event drop",
  "time": "2019-11-13T17:14:13.034211346Z"
}
{
  "output": "Falco internal: syscall event drop. 1 system calls dropped in last second.",
  "output_fields": {
    "ebpf_enabled": "0",
    "n_drops": "1",
    "n_drops_buffer": "1",
    "n_drops_bug": "0",
    "n_drops_pf": "0",
    "n_evts": "31529"
  },
  "priority": "Critical",
  "rule": "Falco internal: syscall event drop",
  "time": "2019-11-13T17:51:17.684344697Z"
}
{
  "output": "Falco internal: syscall event drop. 1 system calls dropped in last second.",
  "output_fields": {
    "ebpf_enabled": "0",
    "n_drops": "1",
    "n_drops_buffer": "1",
    "n_drops_bug": "0",
    "n_drops_pf": "0",
    "n_evts": "23159"
  },
  "priority": "Critical",
  "rule": "Falco internal: syscall event drop",
  "time": "2019-11-13T20:59:24.474584970Z"
}

fntlnz commented 4 years ago

From Repository Planning - Low hanging fruit: start by running in different environments with different kinds of Cpus and workloads and document suggested sizes for the ring buffer.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dnwe commented 4 years ago

keeping fresh

fntlnz commented 4 years ago

What does here think about making the ring buffer size configurable via a flag ?

Also for everyone reading here, the ring buffer is not used for the eBPF implementation. This might also be useful information when comparing performances and to go to the root of the cause of drops.

dnwe commented 4 years ago

We've never tried out the eBPF impl as the docs seemed like they suggested the kernel module was the recommended choice. Do we still get notified of drops with eBPF in the same way? I could give it a spin

fntlnz commented 4 years ago

@dnwe yes! drops go through the same process.

Just yesterday I made a PR documenting the eBPF installation process for multiple kinds of installations https://github.com/falcosecurity/falco-website/pull/134

Please be aware you might still have SYSDIG_BPF_PROBE if you are on < 0.18.0 - everything should be consistently set to FALCO_BPF_PROBE for 0.20.0

derSascha commented 4 years ago

Is there a way to debug the dropped events? On a small VPS, we see e.g. this:

Apr  5 15:20:02 host falco: Falco internal: syscall event drop. 1 system calls dropped in last second.
Apr  5 15:20:02 host falco: 15:20:02.335890806: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=0 n_drops=1 n_drops_buffer=1 n_drops_bug=0 n_drops_pf=0 n_evts=12297)
Apr  5 15:25:02 host falco: Falco internal: syscall event drop. 2 system calls dropped in last second.
Apr  5 15:25:02 host falco: 15:25:02.429912553: Critical Falco internal: syscall event drop. 2 system calls dropped in last second. (ebpf_enabled=0 n_drops=2 n_drops_buffer=2 n_drops_bug=0 n_drops_pf=0 n_evts=12532)
Apr  5 15:45:02 host falco: Falco internal: syscall event drop. 1 system calls dropped in last second.
Apr  5 15:45:02 host falco: 15:45:02.858715115: Critical Falco internal: syscall event drop. 1 system calls dropped in last second. (ebpf_enabled=0 n_drops=1 n_drops_buffer=1 n_drops_bug=0 n_drops_pf=0 n_evts=15544)

The n_evts counter looks like a small event count between the log messages. Any idea what happens here?

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

dnwe commented 4 years ago

What does here think about making the ring buffer size configurable via a flag ?

@fntlnz did anything happen with this in the end? Currently we're still just maintaining our patch and building falco from source, but it'd be cool if we could just use the official docker image and provide the ringbuffer size as a parameter for the userspace app and as a module parameter for the kernel module

nvanheuverzwijn commented 4 years ago

@dnwe Is it an option to increase the frequency of reading the buffer as to avoid having it full in the first place ? Just wondering because we are having this issue as well and we would really like to avoid having to recompile everything and spin our own version of the container!

nvanheuverzwijn commented 4 years ago

Is it expected behavior to have dropped syscal in the log ? We consistently have 1 to 4 syscall dropped every now and then. Should we ignore there message if it's a low amount of dropped syscall ? I wonder what kind of log someone exploiting cve-2019-8339 would generate. If it would generate hundreds of syscall drop, maybe it's an OK solution to just ignore low syscall drop log message.

emcay commented 4 years ago

Are there any plans to port this workaround into the helm chart so that we can pass it as a parameter during install/upgrade?

nvanheuverzwijn commented 4 years ago

@leodido @fntlnz @kris-nova Hey guys, if you are interested in the patch, I could bring it back to falco into the cmake patching folder (falco/cmake/modules/sysdig-repo/patch/).

Let me know what's up.

jannis-a commented 4 years ago

@nvanheuverzwijn +1

nvanheuverzwijn commented 4 years ago

@emcay With our patch, you can pass this environment variable : FALCO_DRIVER_LOADER_ARGS: "--compile --module-arg ring_buf_size=134217728"

You can use our docker image ghcr.io/kronostechnologies/falco:0.24.1-18 (18 commit since patch 0.24.1) however, it is not up to date with the latest as of yet but works for sure.

jannis-a commented 4 years ago

@nvanheuverzwijn This should be sufficient for testing, thank you! Anyway I could take a look at the Dockerfile? Due the nature of our project I cannot blindly run software on our infra. :)

nvanheuverzwijn commented 4 years ago

@jannis-a The fork is here https://github.com/kronostechnologies/falco

leogr commented 4 years ago

@jannis-a The fork is here https://github.com/kronostechnologies/falco

Just out of curiosity, have you noticed any improvements? If yes, could you provide a comparison? Thanks in advance!

danmx commented 4 years ago

@nvanheuverzwijn can you port your feature upstream as soon as Sysdig merges your PR for the probe?

nvanheuverzwijn commented 4 years ago

@danmx All of the PR are open on against every repository for falco and sysdig. I can't do much more than that, it's in the hand of the falco team.

@leogr Our initial goal was to reduce the amount of alerting due to dropped syscall. We could revert back to the old buffer size and compare the result. What would you like to see ? Are you interested in CPU/RAM performance ?

antoinedeschenes commented 4 years ago

@leogr There's a major decrease in syscall drops, but the CPU usage is a little higher due to the larger buffer / less missed calls

leogr commented 4 years ago

Hey @nvanheuverzwijn and @antoinedeschenes

First of all, thanks again for your amazing work. Indeed we already started to review those PRs, though it will take a bit since we have to test it deeply.

What I'm interested in is:

% of drops decrease
CPU/RAM performance
We suspected that by increasing the ring size, the number of events processed per second decreases (possibly because it would slow a bit the whole system that could reduce the events producer performance too). See https://github.com/falcosecurity/falco/issues/669#issuecomment-635476220 for more details. Now, I'm not sure if there's a reliable way to measure this. If you can measure that too somehow it would be great ;)

PS feel free to contact me on Slack if you need any help!

danmx commented 4 years ago

I'd also be interested in performance information 😃 I hope you'll make it public

antoinedeschenes commented 4 years ago

Sure, we could probably make another custom build (required to have dynamic auditing working) without the sysdig patches, easily switch between both and compare

poiana commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

dnwe commented 3 years ago

/remove-lifecycle rotten

oliverchang commented 3 years ago

@fntlnz What should we patch if we want to increase the buffer size used for the eBPF probe, since you mention in https://github.com/falcosecurity/falco/issues/813#issuecomment-581191932 that the eBPF probe doesn't use the same ring buffer?

oliverchang commented 3 years ago

To try to answer my question of what to patch for the eBPF probe, I'm guessing it's this constant here:

https://github.com/falcosecurity/libs/blob/0acbdb729224fc626f994f5ad836009a6a271479/userspace/libscap/scap_bpf.c#L61

Would someone be able to confirm if this is correct?

poiana commented 3 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 3 years ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana commented 3 years ago

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community. /close

poiana commented 3 years ago

@poiana: Closing this issue.

In response to [this](https://github.com/falcosecurity/falco/issues/813#issuecomment-896286305): >Rotten issues close after 30d of inactivity. > >Reopen the issue with `/reopen`. > >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Provide feedback via https://github.com/falcosecurity/community. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kolbeface commented 1 year ago

I think this is the droid you are looking for: https://github.com/draios/sysdig/blob/dev/driver/ppm_ringbuffer.h#L17

The link seems to be dead. Does the droid still exist or did the Jawas take it away? I am running falco on a rather large and complex cluster. We are receiving thousands of dropped syscall a day.

Andreagit97 commented 1 year ago

Ei @kolbeface this is the config we use in Falco to change the buffer size (The description is pretty detailed): https://github.com/falcosecurity/falco/blob/1b62b5ccd1c64cd972ef0252262075cbf42a130c/falco.yaml#L807

We are receiving thousands of dropped syscall a day.

Do you mind to enable some metrics :point_down: https://github.com/falcosecurity/falco/blob/1b62b5ccd1c64cd972ef0252262075cbf42a130c/falco.yaml#L663 to let us understand why you are facing so many drops. Some additional questions

Which driver are you using? kmod, bpf, modern-bpf?
How many rules do you use approximatively?
Are you capturing just syscall or you are using some plugins?
Do you have the k8s metadata enrichment enabled (-k, -K)?

kolbeface commented 1 year ago

@Andreagit97 thanks for the reply! Forgive me, I am very new to all of this and feel a bit out of my comfort zone with this. I inherited this project from a teammate. I do not believe we are using a driver

image: 
  registry: docker.internal-mycompany.com/docker-hub
  repository: falcosecurity/falco-no-driver

I am going to experiment with expanding the buffer today. I will give a go at enabling metrics, it seems fairly straight forward. If I enable the metrics rule will that create a metrics log and store it in the same place we are storing our other falco logs (S3)?

Andreagit97 commented 1 year ago

I am going to experiment with expanding the buffer today. I will give a go at enabling metrics, it seems fairly straight forward. If I enable the metrics rule will that create a metrics log and store it in the same place we are storing our other falco logs (S3)?

Yes :)

falcosecurity / falco

Document falco syscall buffer size adjustment described in blog #813