Closed abroglesc closed 8 months ago
Could you try to disable the K8s metadata support by removing -k
/ -K
?
(just for debugging purposes)
Thanks in advance.
Could you try to disable the K8s metadata support by removing
-k
/-K
? (just for debugging purposes)Thanks in advance.
Thanks for getting back so fast.
I followed your advice and disabled -k
and -K
. After 3 manual tests: the events were successfully captured.
We are curious, when falco fails to enrich metadata and doesn't find the pod or the container, is it silently dropping the events? Can we log the failed state or be aware of it, please?
Thank you!
I had a suspect, and you confirmed it :) However, I'm not yet sure which is the root cause.
We are curious, when falco fails to enrich metadata and doesn't find the pod or the container, is it silently dropping the events? Can we log the failed state or be aware of it, please?
Falco should not drop in this case. When some metadata is missing, the expected behavior is to emit the alert without the missing metadata.
The cause of the wrong behavior may be either:
threshold: 0
? you should see all dropping notifications)OR
I guess the first. Let me know if you can try. Anyway, whatever the case, I think we have to address this issue.
/cc @jasondellaluce /cc @FedeDP /cc @Andreagit97
Yep, i agree with Leo: i'd try with threshold: 0 to see if the event is actually dropped. Otherwise, we might have a bug in sinsp!
I have tried with the following setup/changes. I have enabled back -k
/-K
and set the syscall_event_drops.threshold
value to 0
(as you can observe lower).
root@gke-k8s-runtime-dev--{redacted}:# grep -i -A 5 -B 5 "threshold" falco.yaml
# token bucket. The rate corresponds to one message every 30 seconds
# with a burst of one message (by default).
#
# The messages are emitted when the percentage of dropped system calls
# with respect the number of events in the last second
# is greater than the given threshold (a double in the range [0, 1]).
#
# For debugging/testing it is possible to simulate the drops using
# the `simulate_drops: true`. In this case the threshold does not apply.
syscall_event_drops:
threshold: 0
actions:
- log
- alert
rate: .03333
max_burst: 10
I left the cluster to run for more than 15 minutes (with CronJobs executing every 3 minutes). The last log events from falco remained:
> kubectl logs -f pods/falco-monitoring -c falco
* Success: eBPF probe symlinked to /root/.falco/falco-bpf.o
Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted}
Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml
Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:
Please let me know if I should look somewhere else or apply any other changes. Thank you!
- Success: eBPF probe symlinked to /root/.falco/falco-bpf.o Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted} Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:
Hey @cpungasc
If there was no event drop, it is likely a bug. We need to investigate more. Thank you for your detailed report :pray:
Out of curiosity: why is the Falco version redacted? Have you modified Falco or the driver?
Hi! Can you post the rule you are expecting to see triggered? Thank you for the support!
- Success: eBPF probe symlinked to /root/.falco/falco-bpf.o Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted} Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:
Hey @cpungasc
If there was no event drop, it is likely a bug. We need to investigate more. Thank you for your detailed report 🙏
Out of curiosity: why is the Falco version redacted? Have you modified Falco or the driver?
Hey @leogr
Sure, let me know if I can help in any other way. Regarding the redacted part, we have a fork where we have built some additional component(s) for rules management. The only modification I am aware of is in the JSON formatter for falco, but it is not something that should cause the issue. Let me know if you have doubts and would like me to detail. Thanks for the question, I should have given context around that.
Hi! Can you post the rule you are expecting to see triggered? Thank you for the support!
Hi @FedeDP
Yes, here are the relevant rules:
- condition: (evt.type=open or evt.type=openat) and evt.is_open_read=true and fd.typechar='f'
and fd.num>=0
macro: open_read
- condition: open_read
desc: A file was accessed within the container.
output: '%evt.time,%evt.datetime,%k8s.pod.name,%k8s.pod.id,%k8s.ns.name,%container.id,%container.name,%container.image,%evt.type,%evt.dir,%evt.arg.name,%proc.name,%proc.args,%proc.cmdline,%proc.exe,%proc.exeline,%proc.pid,%proc.pname,%proc.pcmdline,%proc.ppid,%proc.tty,%proc.cwd,%user.name,%fd.name,%fd.directory,%fd.filename,%fd.containerdirectory'
priority: INFO
rule: File read monitor
- rule: Process Creation
condition: (evt.type = execve or syscall.type = execve or evt.type = fork or evt.type = clone)
desc: A process was spawned within the container.
output: "%k8s.pod.name,%k8s.pod.id,%k8s.ns.name,%container.id,%container.name,%container.image,%evt.type,%evt.dir,%evt.arg.name,%proc.name %proc.args,%proc.cmdline,%proc.exe,%proc.exeline,%proc.pid,%proc.pname,%proc.pcmdline,%proc.ppid,%proc.tty,%proc.cwd,%user.name,%fd.name"
priority: INFO
Hope this helps.
Hi! Given that no container/k8s related data is used as a condition, they're not related to missed events. Most probably, you are really seeing events dropping!
A bit of context: both container (docker/podman/cri) and k8s metadata are collected in separate threads and their values is <NA>
until the info are actually loaded.
In the following output:
kubectl logs -f pods/falco-monitoring -c falco
(ie: the test run with `syscall_event_drops.threshold: 0` )
what was the output once you ctrl-c Falco?
Hi! Given that no container/k8s related data is used as a condition, they're not related to missed events. Most probably, you are really seeing events dropping!
A bit of context: both container (docker/podman/cri) and k8s metadata are collected in separate threads and their values is
<NA>
until the info are actually loaded.In the following output:
> kubectl logs -f pods/falco-monitoring -c falco * Success: eBPF probe symlinked to /root/.falco/falco-bpf.o Wed May 25 12:11:06 2022: Falco version 0.31.1-{redacted} Wed May 25 12:11:06 2022: Falco initialized with configuration file /falco.yaml Wed May 25 12:11:06 2022: Loading rules from file /rules/falco_rules.yaml:
(ie: the test run with
syscall_event_drops.threshold: 0
) what was the output once you ctrl-c Falco?
Ran the test once more. The output was:
Mon May 30 14:44:19 2022: Falco initialized with configuration file /falco.yaml
Mon May 30 14:44:19 2022: Loading rules from file /rules/falco_rules.yaml:
Mon May 30 15:17:08 2022: SIGINT received, exiting...
Syscall event drop monitoring:
- event drop detected: 318 occurrences
- num times actions taken: 75
Events detected: 186143
Rule counts by severity:
INFO: 186143
Triggered rules by rule name:
New network connection: 163597
Process Creation: 22410
File write monitor: 136
Ok, so we got some event drops.
Weird things from top of my head:
-k
flag helps? Ok, it increases CPU usage, but you are not dropping that much for it to make a difference...@leogr any thought?
Ok, so we got some event drops. Weird things from top of my head:
- it seems weird that we always drop events for the exact timeframe of the cronjob
- why removing the
-k
flag helps? Ok, it increases CPU usage, but you are not dropping that much for it to make a difference...@leogr any thought?
I guess that k8s support is blocking the main thread for a short amount of time, OR it lets the Falco process goes over 100% CPU usage so that the kernel switches immediately to another process. Both situations can make Falco drop events.
We definitely need to investigate.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale cc @alacuku
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
cc @falcosecurity/falco-maintainers any updates?
I'd tag @deepskyblue86 that did lots of research and fixes on short-lived containers, right? (Thank you!)
Actually I didn't experience any drop so far, just dealing with container creation and container metadata lookup in libs. I'm not familiar with k8s handling in Falco.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 30d of inactivity.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle rotten
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Provide feedback via https://github.com/falcosecurity/community. /close
@poiana: Closing this issue.
/reopen
@abroglesc: Reopened this issue.
Working on that here https://github.com/falcosecurity/falco/issues/2973
Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen
.
Mark the issue as fresh with /remove-lifecycle rotten
.
Provide feedback via https://github.com/falcosecurity/community. /close
@poiana: Closing this issue.
/remove-lifecycle rotten
The new k8smeta
plugin should solve this issue, give it a try with our dedicated helm chart and Falco 0.36.2 https://github.com/falcosecurity/falco/issues/2973#issuecomment-1876880587
This should be solved by Falco 0.37.0! Feel free to reopen if this is still an issue
Describe the bug
Falco didn't capture event(s) in a container running as a CronJob.
How to reproduce it
Create a shell script similar to
test.sh
:Use a CronJob to execute the
test.sh
:Expected behaviour
The used rule set is correct. Confirmed by observing similar type of events for containers that run for a longer period of time than the container used to run the test case (in
test.sh
).Observe (at least):
Environment
System info:
Cloud provider or hardware configuration: GKE
OS:
Kernel:
Installation method: Kubernetes; Part of a daemon set running in a non-privileged container_.
Additional context
Introducing a delay (ex:
sleep 10
) beforetouch "/etc/test.txt"
eliminates the issue.