Open Caroline132 opened 11 months ago
- Falco version:
3.3.0
Do you mean 0.33
? :thinking:
- Falco version:
3.3.0
Do you mean
0.33
? thinking
I guess it's the Helm chart version.
- Falco version:
3.3.0
Do you mean
0.33
? 🤔
My bad. Yes, I put the Helm chart. It's version 0.35.1.
@Caroline132 just double checking is it always null for any rule that triggered in a container workload? Or just sometimes null?
If it is always null something is wrong. if it is sometimes null, it's because things are never perfect in production ...
container runtime is containerd I suppose?
@Caroline132 just double checking is it always null for any rule that triggered in a container workload? Or just sometimes null?
If it is always null something is wrong. if it is sometimes null, it's because things are never perfect in production ...
container runtime is contaienerd I suppose?
The container.image.repository
field is always null for those specific rules (i.e., Non sudo setuid and Redirect STDOUT /STDIN to Network Connection in Container). However, there are other rules that are triggered that have the container.image.repository field as non-null.
And yes, the container runtime is contaienerd.
Thank you for reporting back. Likely it's not related to the specific rule.
I just opened a new ticket to track re-auditing the container engine as it has been on top of my mind at least for a while now to see if we can improve something. https://github.com/falcosecurity/falco/issues/2708
Curious would you be able to add %container.duration
to all of your rules and see if it for example fails more often for events closer to container start or if there are also patterns showing it happens anytime in the container lifetime? Thank you.
https://falco.org/docs/reference/rules/supported-fields/#field-class-container
Thank you for reporting back. Likely it's not related to the specific rule.
I just opened a new ticket to track re-auditing the container engine as it has been on top of my mind at least for a while now to see if we can improve something. #2708
Curious would you be able to add
%container.duration
to all of your rules and see if it for example fails more often for events closer to container start or if there are also patterns showing it happens anytime in the container lifetime? Thank you. https://falco.org/docs/reference/rules/supported-fields/#field-class-container
Thanks for the update and opening the new ticket! I'll add %container.duration
to my rules and monitor the results. I'll keep you posted.
Hi @incertum, for the Non sudo setuid
alert, the container.duration
value is always null. But for the Redirect STDOUT /STDIN to Network Connection in Container
alert, the container.duration
does not seem to appear at the start of the container (and it seems to be pretty random). For example, some of the values obtained were: 1920672687890 ns, 2790773807835 ns and 18446744063141830482 ns.
@Caroline132 thank you for reporting back. We will start investigating what could be done. First we need to run more thorough debugging to understand what the circumstances are when this happen. It likely will take some time and caution. Once we know more we can post some ETA.
As mentioned above we will track this in https://github.com/falcosecurity/falco/issues/2708
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
we are still on it and just added new libsinsp state metrics also around the container engine -> let's see what the data reveal in next 2 weeks.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale still relevant and happening
@jemag perfect timing: We just merged a PR aimed to improve things, see my comment here: https://github.com/falcosecurity/falco/issues/2708#issuecomment-1969575503. I hope you will be able to benefit from these improvements starting with Falco 0.38.0.
Longer term, we have identified more improvement opportunities; however they will take more time.
CC @leogr for awareness.
cc @therealbobo @jasondellaluce @LucaGuerra
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/falcosecurity/community.
/lifecycle stale
/remove-lifecycle stale
/milestone 0.39.0
@Caroline132 would it be possible to give Falco 0.38.0 a try and see if things have slightly improved?
We never expect it to be perfect since API lookups take few milliseconds, but we will continue to refactor and improve the container engine for 0.39.0.
Hi @incertum, we ran the Falco 0.38.0 overnight on two clusters and found that there are still null fields. For example, this instance of Redirect STDOUT/STDIN to Network Connection in Container
has the container image information missing, despite the fact that we know it's coming from Velero:
{"hostname":"aks-zone2-18206000-vmss000000","output":"10:52:25.628489682: Notice Redirect stdout/stdin to network connection (gparent=<NA> ggparent=<NA> gggparent=<NA> fd.sip=10.176.16.179 connection=10.176.20.4:34
814->10.176.16.179:8085 lport=8085 rport=34814 fd_type=ipv4 fd_proto=fd.l4proto evt_type=dup3 user=<NA> user_uid=65532 user_loginuid=-1 process=velero proc_exepath= parent=<NA> command=velero server --uploader-type
=kopia --log-format=json terminal=0 container_id=5c9a8da546b2 container_image=<NA> container_image_tag=<NA> container_name=<NA> k8s_ns=<NA> k8s_pod_name=<NA>)","priority":"Notice","rule":"Redirect STDOUT/STDIN to N
etwork Connection in Container","source":"syscall","tags":["T1059","container","maturity_stable","mitre_execution","network","process"],"time":"2024-05-31T10:52:25.628489682Z", "output_fields": {"container.id":"5c9
a8da546b2","container.image.repository":null,"container.image.tag":null,"container.name":null,"evt.time":1717152745628489682,"evt.type":"dup3","fd.lport":8085,"fd.name":"10.176.20.4:34814->10.176.16.179:8085","fd.r
port":34814,"fd.sip":"10.176.16.179","fd.type":"ipv4","k8s.ns.name":null,"k8s.pod.name":null,"proc.aname[2]":null,"proc.aname[3]":null,"proc.aname[4]":null,"proc.cmdline":"velero server --uploader-type=kopia --log-
format=json","proc.exepath":"","proc.name":"velero","proc.pname":"<NA>","proc.tty":0,"user.loginuid":-1,"user.name":"<NA>","user.uid":65532}}
Hey @Caroline132 statistically what percentage of containers in the logs are null? Could you add container.duration output field?
[Please note that if a container just starts we need to make an API call, which takes at least 500ms, hence in these conditions it is expected that the event has no container information as we do not halt the main kernel event processing thread.]
Hi @incertum, I ran Falco over the weekend and for Redirect STDOUT/STDIN to Network Connection in Container
all of the events have null values (even with the duration > 500ms). The null values seem to only be associated with certain rules (while others have no null values, with container image information filled).
@Caroline132 thanks for reporting back. I am currently unsure why you see null values for all container info fields in only a subset of rules, and I don't know how to debug this issue. You would expect these imperfections to appear across all rules, with a certain percentage of logs having all null fields, distributed a bit more uniformly across all container rules.
We can and will try to improve the container engine even more, but unsure if it would fix the issues you are seeing.
One more thought: Maybe worth a try also exporting k8s.pod.sandbox_id
and if it's the same as the contaienr.id
then the rules in question trigger on sandbox containers that do not have an image. In those cases all fields are expected to be null.
@incertum k8s.pod.sandbox_id
is also null.
to add further details, if we use the container.id we can find the related container image:
I am not aware of anything about the velero image that would prevent fetching its information.
Thanks for the additional info. It seems like all requests are just failing for that container and you hit this code block https://github.com/falcosecurity/libs/blob/74725244659e556ced587c2f0bec7bbd42d39b96/userspace/libsinsp/cri.hpp#L775-L778 so it seems. Which is extremely strange I admit.
Even wondering if we should emit a few more metrics within the metrics framework https://falco.org/docs/metrics/falco-metrics/ around containers. Within the state_counters_enabled
category we currently only emit a current snapshot of the number of cached containers falco.n_missing_container_images
, falco.n_containers
... perhaps it may be useful to keep a counter around completely unsuccessful lookups. Maybe you have more ideas?
Describe the bug
We keep getting alerts that have fields with
NA
andnull
values. Specifically, this causes false positives for theNon sudo setuid
andRedirect STDOUT /STDIN to Network Connection in Container
alerts, which is due thecontainer.image.repository
field that is leftnull
. Here are example logs of the alerts we are getting:How to reproduce it
Trigger a rule that filters based on
container.image.repository
.Expected behaviour
The fields should be populated.
Screenshots
![image](https://github.com/falcosecurity/falco/assets/20731423/b694e907-a218-4d88-a146-152579aee472)
Environment
Falco version:
0.35.1
Cloud provider or hardware configuration:
Azure AKS
OS:
Kernel:
Installation method:
ArgoCD (Helm chart) to Kubernetes
Additional context
We already tried to pass the
--disable-cri-async
flag to Falco and our path to CRI socket for container metadata,--cri <path>
, is properly set.