falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.28k stars 895 forks source link

Memory usage keeps increasing until OOM #3269

Closed ciprian2k closed 2 months ago

ciprian2k commented 2 months ago

Describe the bug

Falco memory usage keeps increasing until OOM

How to reproduce it

Create a custom rule "command_args.yaml"

  • rule: Suspicious Command Args Detected desc: Detects suspicious commands condition: > proc.args contains "--lua-exec" enabled: true output: > Suspicious command detected (user=%user.name command=%proc.cmdline) priority: WARNING tags: [host, data, mitre_discovery]

Run echo multiple times and see memory increase until OOM

watch -n 0.1 "echo --lua-exec"

Screenshots image

Environment

Tue Jul 2 07:22:49 2024: Falco version: 0.37.1 (x86_64) Tue Jul 2 07:22:49 2024: Falco initialized with configuration file: /etc/falco/falco.yaml Tue Jul 2 07:22:49 2024: System info: Linux version 5.14.0-362.18.1.el9_3.x86_64 (mockbuild@x64-builder02.almalinux.org) (gcc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2), GNU ld version 2.35.2-42.el9) #1 SMP PREEMPT_DYNAMIC Mon Jan 29 07:05:48 EST 2024 {"default_driver_version":"7.0.0+driver","driver_api_version":"8.0.0","driver_schema_version":"2.0.0","engine_version":"31","engine_version_semver":"0.31.0","falco_version":"0.37.1","libs_version":"0.14.3","plugin_api_version":"3.2.0"}

FedeDP commented 2 months ago

Hi! Thanks for opening this issue! So, it seems there might be a memleak when the rule triggers. Can you test the same with latest Falco 0.38.1? Thank you very much for reporting!

Also, in case it is still present, can you share the configuration too? Or you are using the default one?

FedeDP commented 2 months ago

So, after

Events detected: 7921227 Rule counts by severity: WARNING: 7921227 Triggered rules by rule name: Suspicious Command Args Detected: 7921227

I see a +8M increase in resident memory:

160604 root 20 0 2471164 214944 193440 S 26,2 0,3 0:11.68 falco
160604 root 20 0 2479436 222784 193440 S 30,8 0,3 11:38.71 falco

We got a problem, Houston. But not that big, at least here.

EDIT: going to run with valgrind massif tool to check if we can easily spot the leak!

FedeDP commented 2 months ago

Ok on a second thought ,considering that i am running

watch -n 0.1 "echo --lua-exec"

i'd expect around 10 events per-second that means 36k events per-hour. How could i reach 8 millions events in like 30minutes :rofl:

ciprian2k commented 2 months ago

Hi @FedeDP,

Thanks for investigating my problem. I've tested now on Falco 0.38.1 and it has the same issue.

Digging more into the problem, I found out that the memory leak is because I have http_output enabled.

http_output: enabled: true url: http://samplemywebsite.com/api/falco

This is the only difference in configuration vs the default one.

Issif commented 2 months ago

I confirm I can reproduce the memory leak. I used the exact rule and a pod running with while true; do echo "-- lua-exec; done`.

The memory usage increases til an OOM: image

  - containerID: containerd://bc51e480adba8a724c297ca9481c6d463c2f0cf556bf61bc37e1af77cf7d6686                                                                                                                     
    image: docker.io/falcosecurity/falco-no-driver:0.38.1                                                                                                                                                          
    imageID: docker.io/falcosecurity/falco-no-driver@sha256:a59cadbaf556c05296dfc8f522786b2138404814797ffbc9ee3b26b336d06903                                                                                       
    lastState:                                                                                                                                                                                                     
      terminated:                                                                                                                                                                                                  
        containerID: containerd://9e7c69c0b51f9c8a014a35a1b2adfa11277fc3a188e65f04e0f09ef4c2238b9e                                                                                                                 
        exitCode: 137                                                                                                                                                                                              
        finishedAt: "2024-07-02T11:02:50Z"                                                                                                                                                                         
        reason: OOMKilled                                                                                                                                                                                          
        startedAt: "2024-07-02T10:37:23Z" 

I will test without the http_output.enabled=true.

Issif commented 2 months ago

I confirm the leak disappears once the http_output is disabled:

image

FedeDP commented 2 months ago

Thank you both very much! I will give it a look and report back :)

FedeDP commented 2 months ago

Out of curiosity, which libcurl version are you using? The bundled one or the system one?

EDIT: Anyway, i am able to reproduce by enabling http output

FedeDP commented 2 months ago

So, it seems like there is something wrong in the curl_easy_perform call here: https://github.com/falcosecurity/falco/blob/master/userspace/falco/outputs_http.cpp#L118 Since commenting it fixes the issue (well, http output does nothing then). I am still digging!

FedeDP commented 2 months ago

So, i tried to repro this with a minimal libcurl-only example but couldn't. Then, i remembered that our outputs queue is unbounded by default and it means that it grows indefinitely; the rule you provided does not specify any syscall thus it matches every syscall/action made by the process called with args --lua-exec, that's why it generates so many output events.

TLDR: setting outputs_queue.capacity to eg: 100 in Falco config fixes the "issue". But please mind that this is not an issue, it is by design behavior, exacerbated by the very wide condition of the rule.

ciprian2k commented 2 months ago

Hi, You are right, setting outputs_queue capacity in config resolves "my problem". Didn't know why the memory was increasing, really thought it was a memory leak.

Thank you again for your help and sorry for the time spent on this matter.

FedeDP commented 2 months ago

No problem sir, thanks for asking! /milestone 0.39.0

chenliu1993 commented 1 week ago

Hi @FedeDP , sry for bring this up, I met the same issue on 0.38.0, but may I know that if I set outputs_queue.capacity to some fixed value, does it mean falco will drop some events if cap is met? If yes, do we have some other options to mitigate this OOM issue? The diff of our env is that we have many network traffic incoming/outcoming

FedeDP commented 1 week ago

does it mean falco will drop some events if cap is met

Yes, exactly.

If yes, do we have some other options to mitigate this OOM issue?

Unfortunately no; well if your system is generating too many events perhaps some rule is too noisy and must be stricter.

chenliu1993 commented 1 week ago

does it mean falco will drop some events if cap is met

Yes, exactly.

If yes, do we have some other options to mitigate this OOM issue?

Unfortunately no; well if your system is generating too many events perhaps some rule is too noisy and must be stricter.

got it thanks for answering. do we have any metrics we can use to monitor when a fixed value is chosen? i read https://falco.org/docs/metrics/falco-metrics/ but having a hard time to understand what metric's meaning actually, like falcosecurity_scap_n_retrieve_evts_drops_total and falcosecurity_scap_n_store_evts_drops_total, the difference between it and etc