falcosecurity / falco

Cloud Native Runtime Security
https://falco.org
Apache License 2.0
7.11k stars 876 forks source link

NA and null fields despite --disable-cri-async #2700

Open Caroline132 opened 11 months ago

Caroline132 commented 11 months ago

Describe the bug

We keep getting alerts that have fields with NA and null values. Specifically, this causes false positives for the Non sudo setuid and Redirect STDOUT /STDIN to Network Connection in Container alerts, which is due the container.image.repository field that is left null. Here are example logs of the alerts we are getting:

{"hostname":"falco-t5qqr","output":"16:36:38.125465102: Notice Unexpected setuid call by no
n-sudo, non-root program (user= user_loginuid=-1 cur_uid=4294967295 parent=<NA> command=<NA
> pid=1591192 uid=<NA> container_id=host image=<NA>) k8s.ns=<NA> k8s.pod=<NA> container=hos
t","priority":"Notice","rule":"Non sudo setuid","source":"syscall","tags":["T1548.001","con
tainer","host","mitre_privilege_escalation","users"],"time":"2023-07-26T16:36:38.125465102Z
", "output_fields": {"container.id":"host","container.image.repository":null,"evt.arg.uid":
"<NA>","evt.time":1690389398125465102,"k8s.ns.name":null,"k8s.pod.name":null,"proc.cmdline"
:"<NA>","proc.pid":1591192,"proc.pname":null,"user.loginuid":-1,"user.name":"","user.uid":4
294967295}}
{"hostname":"falco-t5qqr","output":"15:41:47.009203551: Notice Redirect stdout/stdin to net
work connection (user=root user_loginuid=-1 k8s.ns=calico-system k8s.pod=calico-node-8klhg 
container=ba1fb44b8f96 process=calico-node parent=calico-node cmdline=calico-node -felix pi
d=1535895 terminal=0 container_id=ba1fb44b8f96 image=<NA> fd.name=127.0.0.1:50382->127.0.0.
1:9099 fd.num=0 fd.type=ipv4 fd.sip=127.0.0.1)","priority":"Notice","rule":"Redirect STDOUT
/STDIN to Network Connection in Container","source":"syscall","tags":["T1059","container","
mitre_discovery","mitre_execution","network","process"],"time":"2023-07-26T15:41:47.0092035
51Z", "output_fields": {"container.id":"ba1fb44b8f96","container.image.repository":null,"ev
t.time":1690386107009203551,"fd.name":"127.0.0.1:50382->127.0.0.1:9099","fd.num":0,"fd.sip"
:"127.0.0.1","fd.type":"ipv4","k8s.ns.name":"calico-system","k8s.pod.name":"calico-node-8kl
hg","proc.cmdline":"calico-node -felix","proc.name":"calico-node","proc.pid":1535895,"proc.
pname":"calico-node","proc.tty":0,"user.loginuid":-1,"user.name":"root"}}

How to reproduce it

Trigger a rule that filters based on container.image.repository.

Expected behaviour

The fields should be populated.

Screenshots image image

Environment

Additional context

We already tried to pass the --disable-cri-async flag to Falco and our path to CRI socket for container metadata, --cri <path>, is properly set.

leogr commented 11 months ago
  • Falco version: 3.3.0

Do you mean 0.33 ? :thinking:

Issif commented 11 months ago
  • Falco version: 3.3.0

Do you mean 0.33 ? thinking

I guess it's the Helm chart version.

Caroline132 commented 11 months ago
  • Falco version: 3.3.0

Do you mean 0.33 ? 🤔

My bad. Yes, I put the Helm chart. It's version 0.35.1.

incertum commented 11 months ago

@Caroline132 just double checking is it always null for any rule that triggered in a container workload? Or just sometimes null?

If it is always null something is wrong. if it is sometimes null, it's because things are never perfect in production ...

container runtime is containerd I suppose?

Caroline132 commented 11 months ago

@Caroline132 just double checking is it always null for any rule that triggered in a container workload? Or just sometimes null?

If it is always null something is wrong. if it is sometimes null, it's because things are never perfect in production ...

container runtime is contaienerd I suppose?

The container.image.repository field is always null for those specific rules (i.e., Non sudo setuid and Redirect STDOUT /STDIN to Network Connection in Container). However, there are other rules that are triggered that have the container.image.repository field as non-null.

And yes, the container runtime is contaienerd.

incertum commented 11 months ago

Thank you for reporting back. Likely it's not related to the specific rule.

I just opened a new ticket to track re-auditing the container engine as it has been on top of my mind at least for a while now to see if we can improve something. https://github.com/falcosecurity/falco/issues/2708

Curious would you be able to add %container.duration to all of your rules and see if it for example fails more often for events closer to container start or if there are also patterns showing it happens anytime in the container lifetime? Thank you. https://falco.org/docs/reference/rules/supported-fields/#field-class-container

Caroline132 commented 11 months ago

Thank you for reporting back. Likely it's not related to the specific rule.

I just opened a new ticket to track re-auditing the container engine as it has been on top of my mind at least for a while now to see if we can improve something. #2708

Curious would you be able to add %container.duration to all of your rules and see if it for example fails more often for events closer to container start or if there are also patterns showing it happens anytime in the container lifetime? Thank you. https://falco.org/docs/reference/rules/supported-fields/#field-class-container

Thanks for the update and opening the new ticket! I'll add %container.duration to my rules and monitor the results. I'll keep you posted.

Caroline132 commented 11 months ago

Hi @incertum, for the Non sudo setuid alert, the container.duration value is always null. But for the Redirect STDOUT /STDIN to Network Connection in Container alert, the container.duration does not seem to appear at the start of the container (and it seems to be pretty random). For example, some of the values obtained were: 1920672687890 ns, 2790773807835 ns and 18446744063141830482 ns.

incertum commented 11 months ago

@Caroline132 thank you for reporting back. We will start investigating what could be done. First we need to run more thorough debugging to understand what the circumstances are when this happen. It likely will take some time and caution. Once we know more we can post some ETA.

As mentioned above we will track this in https://github.com/falcosecurity/falco/issues/2708

poiana commented 7 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

incertum commented 7 months ago

/remove-lifecycle stale

we are still on it and just added new libsinsp state metrics also around the container engine -> let's see what the data reveal in next 2 weeks.

poiana commented 4 months ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

jemag commented 4 months ago

/remove-lifecycle stale still relevant and happening

incertum commented 4 months ago

@jemag perfect timing: We just merged a PR aimed to improve things, see my comment here: https://github.com/falcosecurity/falco/issues/2708#issuecomment-1969575503. I hope you will be able to benefit from these improvements starting with Falco 0.38.0.

Longer term, we have identified more improvement opportunities; however they will take more time.

CC @leogr for awareness.

leogr commented 4 months ago

cc @therealbobo @jasondellaluce @LucaGuerra

poiana commented 1 month ago

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

incertum commented 1 month ago

/remove-lifecycle stale

LucaGuerra commented 1 month ago

/milestone 0.39.0

incertum commented 1 month ago

@Caroline132 would it be possible to give Falco 0.38.0 a try and see if things have slightly improved?

We never expect it to be perfect since API lookups take few milliseconds, but we will continue to refactor and improve the container engine for 0.39.0.

Caroline132 commented 1 month ago

Hi @incertum, we ran the Falco 0.38.0 overnight on two clusters and found that there are still null fields. For example, this instance of Redirect STDOUT/STDIN to Network Connection in Container has the container image information missing, despite the fact that we know it's coming from Velero:

{"hostname":"aks-zone2-18206000-vmss000000","output":"10:52:25.628489682: Notice Redirect stdout/stdin to network connection (gparent=<NA> ggparent=<NA> gggparent=<NA> fd.sip=10.176.16.179 connection=10.176.20.4:34
814->10.176.16.179:8085 lport=8085 rport=34814 fd_type=ipv4 fd_proto=fd.l4proto evt_type=dup3 user=<NA> user_uid=65532 user_loginuid=-1 process=velero proc_exepath= parent=<NA> command=velero server --uploader-type
=kopia --log-format=json terminal=0 container_id=5c9a8da546b2 container_image=<NA> container_image_tag=<NA> container_name=<NA> k8s_ns=<NA> k8s_pod_name=<NA>)","priority":"Notice","rule":"Redirect STDOUT/STDIN to N
etwork Connection in Container","source":"syscall","tags":["T1059","container","maturity_stable","mitre_execution","network","process"],"time":"2024-05-31T10:52:25.628489682Z", "output_fields": {"container.id":"5c9
a8da546b2","container.image.repository":null,"container.image.tag":null,"container.name":null,"evt.time":1717152745628489682,"evt.type":"dup3","fd.lport":8085,"fd.name":"10.176.20.4:34814->10.176.16.179:8085","fd.r
port":34814,"fd.sip":"10.176.16.179","fd.type":"ipv4","k8s.ns.name":null,"k8s.pod.name":null,"proc.aname[2]":null,"proc.aname[3]":null,"proc.aname[4]":null,"proc.cmdline":"velero server --uploader-type=kopia --log-
format=json","proc.exepath":"","proc.name":"velero","proc.pname":"<NA>","proc.tty":0,"user.loginuid":-1,"user.name":"<NA>","user.uid":65532}}
incertum commented 1 month ago

Hey @Caroline132 statistically what percentage of containers in the logs are null? Could you add container.duration output field?

[Please note that if a container just starts we need to make an API call, which takes at least 500ms, hence in these conditions it is expected that the event has no container information as we do not halt the main kernel event processing thread.]

Caroline132 commented 1 month ago

Hi @incertum, I ran Falco over the weekend and for Redirect STDOUT/STDIN to Network Connection in Container all of the events have null values (even with the duration > 500ms). The null values seem to only be associated with certain rules (while others have no null values, with container image information filled). image

incertum commented 1 month ago

@Caroline132 thanks for reporting back. I am currently unsure why you see null values for all container info fields in only a subset of rules, and I don't know how to debug this issue. You would expect these imperfections to appear across all rules, with a certain percentage of logs having all null fields, distributed a bit more uniformly across all container rules.

We can and will try to improve the container engine even more, but unsure if it would fix the issues you are seeing.

incertum commented 1 month ago

One more thought: Maybe worth a try also exporting k8s.pod.sandbox_id and if it's the same as the contaienr.id then the rules in question trigger on sandbox containers that do not have an image. In those cases all fields are expected to be null.

ctdfo commented 1 month ago

@incertum k8s.pod.sandbox_id is also null. image

jemag commented 1 month ago

to add further details, if we use the container.id we can find the related container image: image I am not aware of anything about the velero image that would prevent fetching its information.

incertum commented 1 month ago

Thanks for the additional info. It seems like all requests are just failing for that container and you hit this code block https://github.com/falcosecurity/libs/blob/74725244659e556ced587c2f0bec7bbd42d39b96/userspace/libsinsp/cri.hpp#L775-L778 so it seems. Which is extremely strange I admit.

Even wondering if we should emit a few more metrics within the metrics framework https://falco.org/docs/metrics/falco-metrics/ around containers. Within the state_counters_enabled category we currently only emit a current snapshot of the number of cached containers falco.n_missing_container_images, falco.n_containers ... perhaps it may be useful to keep a counter around completely unsuccessful lookups. Maybe you have more ideas?