Missed events in a short-lived container running as a CronJob

abroglesc commented 2 years ago

Describe the bug

Falco didn't capture event(s) in a container running as a CronJob.

How to reproduce it

Create a shell script similar to test.sh:

#!/bin/bash

set -euxo pipefail

touch "/etc/test.txt"

Use a CronJob to execute the test.sh:

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: test-cronjob
spec:
  schedule: "*/5 * * * *"
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: falco-test-cronjob
            image: # YOUR IMAGE
            command: ["/bin/bash", "test.sh"]
            volumeMounts:
            - mountPath: /host
              name: host-mount
            securityContext:
              privileged: true
          restartPolicy: OnFailure
          serviceAccountName: random-sa-name
          volumes:
          - name: host-mount
            hostPath:
              path: /
              type: Directory

Expected behaviour

The used rule set is correct. Confirmed by observing similar type of events for containers that run for a longer period of time than the container used to run the test case (in test.sh).

Observe (at least):

creation of a new process

execve("/usr/bin/touch", ["touch", "/etc/test.txt"], 0x7fff176111d8 /* 60 vars */) = 0

opening and possibly creation of a file

openat(AT_FDCWD, "/etc/test.txt", O_WRONLY|O_CREAT|O_NOCTTY|O_NONBLOCK, 0666) = 3

Environment

System info:

# falco --support | jq .system_info
Mon May 23 16:28:03 2022: Falco version 0.31.1
Mon May 23 16:28:03 2022: Falco initialized with configuration file /etc/falco/falco.yaml
Mon May 23 16:28:03 2022: Loading rules from file /rules/falco_rules.yaml:
{
"machine": "x86_64",
"nodename": "redacted",
"release": "5.4.170+",
"sysname": "Linux",
"version": "#1 SMP Wed Mar 23 10:13:41 PDT 2022"
}

Cloud provider or hardware configuration: GKE

OS:

# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel:

Linux gke-k8s-{redacted} 5.4.170+ falcosecurity/libs#1 SMP Wed Mar 23 10:13:41 PDT 2022 x86_64 GNU/Linux

Installation method: Kubernetes; Part of a daemon set running in a non-privileged container_.

Additional context

Introducing a delay (ex: sleep 10) before touch "/etc/test.txt" eliminates the issue.

leogr commented 2 years ago

Could you try to disable the K8s metadata support by removing -k / -K ? (just for debugging purposes)

Thanks in advance.

cpungasc commented 2 years ago

Could you try to disable the K8s metadata support by removing -k / -K ? (just for debugging purposes)

Thanks in advance.

Thanks for getting back so fast.

I followed your advice and disabled -k and -K. After 3 manual tests: the events were successfully captured.

We are curious, when falco fails to enrich metadata and doesn't find the pod or the container, is it silently dropping the events? Can we log the failed state or be aware of it, please?

Thank you!

leogr commented 2 years ago

I had a suspect, and you confirmed it :) However, I'm not yet sure which is the root cause.

We are curious, when falco fails to enrich metadata and doesn't find the pod or the container, is it silently dropping the events? Can we log the failed state or be aware of it, please?

Falco should not drop in this case. When some metadata is missing, the expected behavior is to emit the alert without the missing metadata.

The cause of the wrong behavior may be either:

Falco is actually dropping the event (because for some reason processing that event takes too long) and the syscall dropping threshold setting hides the dropping notification (could you try with threshold: 0? you should see all dropping notifications)

OR

Libsinsp has a bug that skips the event when is not able to fetch K8s metadata in time.

I guess the first. Let me know if you can try. Anyway, whatever the case, I think we have to address this issue.

/cc @jasondellaluce /cc @FedeDP /cc @Andreagit97

FedeDP commented 2 years ago

Yep, i agree with Leo: i'd try with threshold: 0 to see if the event is actually dropped. Otherwise, we might have a bug in sinsp!

cpungasc commented 2 years ago

I have tried with the following setup/changes. I have enabled back -k/-K and set the syscall_event_drops.threshold value to 0 (as you can observe lower).

root@gke-k8s-runtime-dev--{redacted}:# grep -i -A 5 -B 5 "threshold" falco.yaml
# token bucket. The rate corresponds to one message every 30 seconds
# with a burst of one message (by default).
#
# The messages are emitted when the percentage of dropped system calls
# with respect the number of events in the last second
# is greater than the given threshold (a double in the range [0, 1]).
#
# For debugging/testing it is possible to simulate the drops using
# the `simulate_drops: true`. In this case the threshold does not apply.

syscall_event_drops:
  threshold: 0
  actions:
    - log
    - alert
  rate: .03333
  max_burst: 10

I left the cluster to run for more than 15 minutes (with CronJobs executing every 3 minutes). The last log events from falco remained:

> kubectl logs -f pods/falco-monitoring -c falco

* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

Please let me know if I should look somewhere else or apply any other changes. Thank you!

leogr commented 2 years ago

Success: eBPF probe symlinked to /root/.falco/falco-bpf.o Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted} Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

Hey @cpungasc

If there was no event drop, it is likely a bug. We need to investigate more. Thank you for your detailed report :pray:

Out of curiosity: why is the Falco version redacted? Have you modified Falco or the driver?

FedeDP commented 2 years ago

Hi! Can you post the rule you are expecting to see triggered? Thank you for the support!

cpungasc commented 2 years ago

Success: eBPF probe symlinked to /root/.falco/falco-bpf.o Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted} Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:

Hey @cpungasc

If there was no event drop, it is likely a bug. We need to investigate more. Thank you for your detailed report 🙏

Out of curiosity: why is the Falco version redacted? Have you modified Falco or the driver?

Hey @leogr

Sure, let me know if I can help in any other way. Regarding the redacted part, we have a fork where we have built some additional component(s) for rules management. The only modification I am aware of is in the JSON formatter for falco, but it is not something that should cause the issue. Let me know if you have doubts and would like me to detail. Thanks for the question, I should have given context around that.

cpungasc commented 2 years ago

Hi! Can you post the rule you are expecting to see triggered? Thank you for the support!

Hi @FedeDP

Yes, here are the relevant rules:

- condition: (evt.type=open or evt.type=openat) and evt.is_open_read=true and fd.typechar='f'
    and fd.num>=0
  macro: open_read

- condition: open_read
  desc: A file was accessed within the container.
  output: '%evt.time,%evt.datetime,%k8s.pod.name,%k8s.pod.id,%k8s.ns.name,%container.id,%container.name,%container.image,%evt.type,%evt.dir,%evt.arg.name,%proc.name,%proc.args,%proc.cmdline,%proc.exe,%proc.exeline,%proc.pid,%proc.pname,%proc.pcmdline,%proc.ppid,%proc.tty,%proc.cwd,%user.name,%fd.name,%fd.directory,%fd.filename,%fd.containerdirectory'
  priority: INFO
  rule: File read monitor

- rule: Process Creation
  condition: (evt.type = execve or syscall.type = execve or evt.type = fork or evt.type = clone)
  desc: A process was spawned within the container.
  output: "%k8s.pod.name,%k8s.pod.id,%k8s.ns.name,%container.id,%container.name,%container.image,%evt.type,%evt.dir,%evt.arg.name,%proc.name %proc.args,%proc.cmdline,%proc.exe,%proc.exeline,%proc.pid,%proc.pname,%proc.pcmdline,%proc.ppid,%proc.tty,%proc.cwd,%user.name,%fd.name"
  priority: INFO

Hope this helps.

FedeDP commented 2 years ago

Hi! Given that no container/k8s related data is used as a condition, they're not related to missed events. Most probably, you are really seeing events dropping!

A bit of context: both container (docker/podman/cri) and k8s metadata are collected in separate threads and their values is <NA> until the info are actually loaded.

In the following output:


kubectl logs -f pods/falco-monitoring -c falco

Success: eBPF probe symlinked to /root/.falco/falco-bpf.o Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted} Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:
```
(ie: the test run with `syscall_event_drops.threshold: 0` )
what was the output once you ctrl-c Falco?
```

cpungasc commented 2 years ago

Hi! Given that no container/k8s related data is used as a condition, they're not related to missed events. Most probably, you are really seeing events dropping!

A bit of context: both container (docker/podman/cri) and k8s metadata are collected in separate threads and their values is <NA> until the info are actually loaded.

In the following output:
> kubectl logs -f pods/falco-monitoring -c falco

* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:
(ie: the test run with syscall_event_drops.threshold: 0 ) what was the output once you ctrl-c Falco?

Ran the test once more. The output was:

Mon May 30 14:44:19 2022: Falco initialized with configuration file /falco.yaml
Mon May 30 14:44:19 2022: Loading rules from file /rules/falco_rules.yaml:
Mon May 30 15:17:08 2022: SIGINT received, exiting...
Syscall event drop monitoring:
   - event drop detected: 318 occurrences
   - num times actions taken: 75
Events detected: 186143
Rule counts by severity:
   INFO: 186143
Triggered rules by rule name:
   New network connection: 163597
   Process Creation: 22410
   File write monitor: 136

FedeDP commented 2 years ago

Ok, so we got some event drops.
Weird things from top of my head:

it seems weird that we always drop events for the exact timeframe of the cronjob
why removing the -k flag helps? Ok, it increases CPU usage, but you are not dropping that much for it to make a difference...

@leogr any thought?

leogr commented 2 years ago

Ok, so we got some event drops. Weird things from top of my head:

it seems weird that we always drop events for the exact timeframe of the cronjob

why removing the -k flag helps? Ok, it increases CPU usage, but you are not dropping that much for it to make a difference...

@leogr any thought?

I guess that k8s support is blocking the main thread for a short amount of time, OR it lets the Falco process goes over 100% CPU usage so that the kernel switches immediately to another process. Both situations can make Falco drop events.

We definitely need to investigate.

poiana commented 2 years ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

leogr commented 2 years ago

/remove-lifecycle stale cc @alacuku

poiana commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

alacuku commented 1 year ago

/remove-lifecycle stale

leogr commented 1 year ago

cc @falcosecurity/falco-maintainers any updates?

FedeDP commented 1 year ago

I'd tag @deepskyblue86 that did lots of research and fixes on short-lived containers, right? (Thank you!)

deepskyblue86 commented 1 year ago

Actually I didn't experience any drop so far, just dealing with container creation and container metadata lookup in libs. I'm not familiar with k8s handling in Falco.

poiana commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

alacuku commented 1 year ago

/remove-lifecycle stale

poiana commented 1 year ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

poiana commented 1 year ago

Stale issues rot after 30d of inactivity.

Mark the issue as fresh with /remove-lifecycle rotten.

Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle rotten

poiana commented 1 year ago

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community. /close

poiana commented 1 year ago

@poiana: Closing this issue.

In response to [this](https://github.com/falcosecurity/falco/issues/2013#issuecomment-1681165007): >Rotten issues close after 30d of inactivity. > >Reopen the issue with `/reopen`. > >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Provide feedback via https://github.com/falcosecurity/community. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

abroglesc commented 1 year ago

/reopen

poiana commented 1 year ago

@abroglesc: Reopened this issue.

In response to [this](https://github.com/falcosecurity/falco/issues/2013#issuecomment-1681305145): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

Andreagit97 commented 1 year ago

Working on that here https://github.com/falcosecurity/falco/issues/2973

poiana commented 1 year ago

Rotten issues close after 30d of inactivity.

Reopen the issue with /reopen.

Mark the issue as fresh with /remove-lifecycle rotten.

Provide feedback via https://github.com/falcosecurity/community. /close

poiana commented 1 year ago

@poiana: Closing this issue.

In response to [this](https://github.com/falcosecurity/falco/issues/2013#issuecomment-1741794013): >Rotten issues close after 30d of inactivity. > >Reopen the issue with `/reopen`. > >Mark the issue as fresh with `/remove-lifecycle rotten`. > >Provide feedback via https://github.com/falcosecurity/community. >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

Andreagit97 commented 1 year ago

/remove-lifecycle rotten

Andreagit97 commented 8 months ago

The new k8smeta plugin should solve this issue, give it a try with our dedicated helm chart and Falco 0.36.2 https://github.com/falcosecurity/falco/issues/2973#issuecomment-1876880587

Andreagit97 commented 8 months ago

This should be solved by Falco 0.37.0! Feel free to reopen if this is still an issue

falcosecurity / falco

Missed events in a short-lived container running as a CronJob #2013